# Homework 6

## CSCI E-82A


In the Dynamic Programming (DP) lesson Jupyter notebook, we constructed a representation of a simple grid world. DP was used to find optimal plans for a robot to navigate from any starting location on the grid to the goal. This problem is an analog for more complex real-world robot navigation problems. 

In this homework you will use DP to solve a slightly more complex robotic navigation problem in a grid world. This grid world is a simple version of the problem a material transport robot might encounter in a warehouse. The situation is illustrated in the figure below.

<img src="GridWorldFactory.JPG" alt="Drawing" style="width:200px; height:200px"/>
<center> Grid World for Factory Navigation Example </center>

The goal is for the robot to deliver some material to position (state) 12, shown in blue. Since there is a goal state or **terminal state** this an **episodic task**. 

There are some barriers comprised of the states $\{ 6, 7, 8 \}$ and $\{ 16, 17, 18 \}$, shown with hash marks. In a real warehouse, these positions might be occupied by shelving or equipment. We do not want the robot to hit these barriers. Thus, we say that transitioning to these barrier states is **taboo**.

As before, we do not want the robot to hit the edges of the grid world, which represent the outer walls of the warehouse. 



## Representation

As with many such problems, the starting place is creating the **representation**. In the cell below encode your representation for the possible action-state transitions. From each state there are 4 possible actions:
- up, u
- down, d,
- left, l
- right, r

There are a few special cases you need to consider:
- Any action transitioning to a state off the grid or into a barrier should keep the state unchanged. 
- Once in the goal state there are no more state transitions. 
- Any transition within the barrier (taboo) states can keep the state unchanged. If you experiment, you will see that other encodings work as well since the value of a barrier states are always zero and there are no actions transitioning into these states. 

> **Hint:** It may help you create a pencil and paper sketch of the transitions, rewards, and probabilities or policy. This can help you to keep the bookkeeping correct. 

In the cell below define a dictionary where your code can look up the successor state given the current state and action. 

In [None]:
## import numpy for latter
import numpy as np



You need to define the initial policy for the Markov process. Set the probabilities for each transition as a **uniform distribution** leading to random action by the robot. In the subsequent sections of this notebook you will improve this policy. 

> **Note:** As these are just starting values, the exact values of the transition probabilities are not actually all that important in terms of solving the DP problem. Also, notice that it does not matter how the taboo state transitions are encoded. The point of the DP algorithm is to learn the transition policy.  

In the cell below, define a dictionary with the initial policy. 

The robot receives the following rewards:
- +10 for achieving the goal. 
- -1 for attempting to leave the warehouse or hitting the barriers. In other words, we penalize the robot for hitting the edges of the grid or the barriers.  
- -0.1 for all other state transitions, which is the cost for the robot to move from one state to another.  

In the code cell below encode a dictionary with your representation of this reward structure.  

You will find it useful to create a list of taboo states, which you can encode in the cell below.

## Policy Evaluation

With your representation defined, you can now create and test a function for **policy evaluation**. You will need this function for your policy iteration code. 

You are welcome to start with the `compute_state_value` function from the DP notebook. However, keep in mind that you must modify this code to correctly treat the taboo states. Specifically, taboo states should have 0 value. 

Examine the state values you have computed using a random walk for the robot. Answer the following questions:

1. Are the values of the goal and taboo states zero? 
2. Do the values of the states increase closer to the goal? 
3. Do the goal and barrier states all have zero values? 

ANS 1:    
ANS 2:    
ANS 3:    

## Policy Iteration

Now that you have your representation and a functions to perform the MC policy evaluation you have everything you need to apply the policy improvement algorithm to create an optimal policy for the robot to reach the goal. 

If your policy evaluation functions works correctly, you should be able to use the `policy_iteration` function from the DP notebook with minor modifications. Make sure you print the state values as well as the policy you discovered. 

Examine your results. First look at the state values at convergence of the policy iteration algorithm and answer the following questions:
1. Are non-taboo state values closest to the goal the largest? 
2. Are the non-taboo state values farthest from the goal the smallest? Keep in mind the robot must travel around the barrier. 
3. Are the non-taboo state values symmetric (e.g. same) with respect to distance from the goal? 
4. Do the taboo states have 0 values? 
5. How do the state values of the improved policy compare to the state values of the initial policy?


ANS 1:     
ANS 2:  
ANS 3:   
ANS 4:  
ANS 5: 

Next, examine the policy you have computed. Do the following:
- Follow the optimal paths from the 4 corners of the grid to the goal. How does the symmetry and length of these paths make sense in terms of length and state values? 

ANS: 

- Imagine that the door for the warehouse is at position (state) 2. Insert an illustration showing the paths of the optimal plans below. You are welcome to start with the PowerPoint illustration in the course Github repository.  

**Insert your image here**    
 
<img src="PlanOnGridWorld.JPG" alt="Drawing" style="width:200px; height:200px"/>
<center> Grid world optimal plans from state 2 to the goal shown goes here </center>


## Value Iteration 

Finally,  use the value iteration algorithm to compute an optimal policy for the robot reaching the goal. Keep in mind that you will need to maintain a state value of 0 for the taboo states. You may use the 'value_iteration' function from the DP notebook with minor modifications.

Compare your results from the value iteration algorithm to your results from the policy iteration algorithm and answer the following questions:   
1. Are the state values identical between the two methods?    
2. Ignoring the taboo states, are the plans computed by the two methods identical?    
3. Ignoring the taboo states, are the final state values computed from both methods the same. 

ANS 1:        
ANS 2:      
ANS 3:   