# Vanilla Deep Q-Learning 
---
In this notebook, you will implement Dynamic Programming algorithm.

<a target="_blank" href="https://colab.research.google.com/github/PrzemekSekula/ReinforcementLearningClasses/blob/master/DynamicProgramming/DP_empty.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />
    Run in Google Colab</a>

In [None]:
import sys
IN_COLAB = "google.colab" in sys.modules

if IN_COLAB:
    !wget https://raw.githubusercontent.com/PrzemekSekula/ReinforcementLearningClasses/main/DynamicProgramming/helper.py

In [None]:
import numpy as np

from helper import State, Simulator

%matplotlib inline

# World description
### Actions
Our world is a grid world. An agent can move the world travelling in four main directions, so four actions are possible:
- 0 (up)
- 1 (right)
- 2 (down)
- 3 (left)

Each action moves us to the corresponding state. If it is impossible to be moved, we are staying in the same state. There is also a terminal state (marked red). Reaching a terminal state ends the episode.

### States
A state is a location of the agent. A standard way to describe the state is by using an instance of a `State` class. Such an object has two properties: `state.row` and `state.col` that are describing the position of the agent. Both rows and cols are counted from `0`, so the upper left corner corresponds to `(0, 0)` state. Another way of describing a state is by using a tuple `(row, col)`. Such a format is also accepted by methods implemented in a simulator.

### Rewards
For each move a negative reward `Reward = -1` is granted. Aditionally, for entering each state a reward associated with this state is granted.



# Simulator Description
A main goals of a simulator are as follows:
- store the data about the world
- store the current policy
- store the current value function
- facilitate RL-related operations

#### Properties:
- `world` - numpy.array with the world. The numbers correspond to the rewards for reaching each state
- `policy` - numpy.array with policy. Policy is always deterministic. The numbers represent specific actions: 0 (up), 1 (right), 2 (down), 3 (left)
- `values` - numpy.array with the state-value function for each state
- `reward` - aditional reward granted for performing each action
- `terminal` - a terminal state. It is an instance of the `State` class. 

#### Methods:
- `move` - Returns a state that is the result of an action.
- `getReward` - Returns the reward for entering a specific state (location).
- `getValue` - Returns a Value function for a determined state (location).
- `getPolicy` - Returns a policy for a determined state (location).
- `setValue` - Sets a Value function for a determined state (location).
- `setPolicy` - Sets a policy for a determined state (location).
- `plot` - Visualizes the world, value function and policy.



In [None]:
world = np.array([
    [ 0,    0,   0,  0],
    [-10, -10, -15,  0],
    [ 0,   0,   0,   0],
    [ 0, -10, -10, -10],
    [ 0,   0,   0,   0],
])



sim = Simulator(
    world = world, # Our World
    terminal = [x-1 for x in world.shape], # t. state in lower right corner
    reward = -1 # Reward for each step
    )

sim.policy = 3 + sim.world * 0
sim.plot()

#### Reward, Value, Policy
We can communicate with world, reading Rewards, Value functions and 
Let's see how it works

In [None]:
#Let's start with state (location) 1, 2 (row nr 1, column nr 2)



##### Setting values and policies.

#### Using `State` class

#### Using `simulator.move()` method
There are two ways of using `simulator.move()` method. In the first version, you provide two arguments:
- `state` - the starting location that you want to move from. It may be either an instance of a `State` class or a tupple with (row, col) coordinates
- `action` - action to be performed. One of: 0 (up), 1 (right), 2 (down), 3 (left)

The method returns a destination state (an instance of the `State()` class)

In the second version you should just provide a starting location (state) that you want to move from. The action is selected from the current policy.

# TO DO
Create a code that uses one of DP algorithms to find an optimal policy for the given world.
To pass the assignment you have to:
- show that your code works
- understand the code
- understand the theory behind the code
- understand the general idea of dynamic programming.

*Note 1: It is not necessary to use the simulator, but you should at least consider it. It will make your life much easier.*

*Note 2: You may use any discount rate you wish, but I recommend you to discount your rewards (use something < 1, eg. 0.9)*

In [None]:
world = -np.array([
    [ 0,  0,  0,  0,  0,  0],
    [10, 10, 11, 10, 10,  0],
    [ 0,  0,  0,  0,  0,  0],
    [ 0, 10, 10,  4, 10, 10],
    [ 0,  0,  0,  0,  0,  0],
    [10, 10, 10, 10, 10,  0],    
    [ 0,  0,  0,  0,  0,  0],
    [ 0, 10, 10, 10, 10, 10],
    [ 0,  0, 0,   0,  0,  0],
])


sim = Simulator(
    world = world, # Our World
    terminal = [x-1 for x in world.shape], # t. state in lower right corner
    reward = -1 # Reward for each step
    )

sim.plot()

In [None]:
gamma = 0.9