# An Overview 

We will establish the state space, or we can think of this as the environment in which the agent will be trained. When an agent is in a state, it chooses an action based on a policy. There are 2 ways to set up policies:
+ Deterministic: in each state, the value of the actions is specifically defined.
+ Stochastic: in each state, the values of the action and state form a probability space.

In this example, we will use a stochastic policy setting. At each state, the search space for the action will be evaluated according to a random function, we will take the largest value in that range to choose the behavior.

Note: in this example, the main purpose is to describe the definitions in the environment and agent settings in reinforcement learning.

In [1]:
import numpy as np

In [4]:
n_states = 4

env = np.arange(n_states*n_states).reshape((n_states, n_states))
env

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [13]:
n_actions = 4
q_table = np.random.rand(n_states, n_actions)
q_table

array([[0.0457407 , 0.23235773, 0.82253313, 0.0722362 ],
       [0.7508166 , 0.74588432, 0.32160587, 0.34907592],
       [0.91292341, 0.20002638, 0.81480425, 0.94573611],
       [0.09645007, 0.2922429 , 0.61120022, 0.6111294 ]])

In [20]:
state = 0
discount_factor = 0.01
learning_rate = 0.9

while True: 
    # To make sure that when the next state is bounded
    if state + 1 >= n_states: 
        break

    # From the current state, we would choose the action that has the highest score
    action_from_state = q_table[state]
    action = np.argmax(action_from_state)

    currrent_q_value = q_table[state, action]
    reward = env[state, action]
    next_action = np.max(q_table[state+1])
    target_q_value = reward + discount_factor*next_action

    q_table[state, action] = learning_rate* (target_q_value - currrent_q_value)

    state+= 1


In [21]:
q_table

array([[0.0457407 , 0.23235773, 1.05763179, 0.0722362 ],
       [1.26470457, 0.74588432, 0.32160587, 0.34907592],
       [0.91292341, 0.20002638, 0.81480425, 2.4133931 ],
       [0.09645007, 0.2922429 , 0.61120022, 0.6111294 ]])