# <FONT COLOR="red">***RL ALGORITHMS***</FONT>
---
---

Reinforcement learning (RL) is a branch of machine learning where Agents learn to make optimal decisions through interaction with a around. This set of techniques is essential for situations in which a agent takes actions in an environment to maximize a cumulative reward over time.

## <FONT COLOR="orange">**Introduction to the main RL algorithms**</FONT>
---
---

Generate a simple game environment where an agent must learn to navigate and collect items. Dene statuses, actions and rewards for the agent

In [48]:
# ENVIRONMENT CLASS
class Environment:
  # CONSTRUCTOR CLASS
  def __init__ (self):
    self._state_space = [0,1,2,3]         # Posible states
    self._action_space = [0,1]            # Posible actions
    self._rewards = {0:-1,1:-1,2:-1,3:10} # Rewards for each state

  # GETTERS
  @property
  def get_state_space(self):
    return self._state_space

  @property
  def get_action_space(self):
    return self._action_space

  @property
  def get_rewards(self):
    return self._rewards

In [49]:
# ENVIRONMENT INSTANCE
env = Environment()

In [50]:
# TEST IMPLEMENTATION
print(f'States: {env.get_state_space}')
print(f'Actions: {env.get_action_space}')
print(f'Rewards: {env.get_rewards}')

States: [0, 1, 2, 3]
Actions: [0, 1]
Rewards: {0: -1, 1: -1, 2: -1, 3: 10}


## <FONT COLOR="orange">**EXERCISE 2**</FONT>
---
---

Implements the Q-Learning algorithm so that an agent learn to navigate and collect objects in the defined environment. Show how the Q-value function is updated.

In [51]:
# IMPORT COMMON LIBRARIES
import numpy as np

In [52]:
# INITIALIZE THE Q-TABLE
Q = np.zeros((len(env.get_state_space), len(env.get_action_space)))

In [53]:
# INITIALIZE THE ALGORITHM PARAMETERS
alpha = 0.1   # Learning rate
gamma = 0.9   # Discount factor

In [54]:
# TRAIN THE AGENT USING Q-LEARNING
for _ in range(1000):
  state = np.random.choice(env.get_state_space)     # Random initial state
  while state != 3:
    action = np.random.choice(env.get_action_space) # Select random action
    next_state = state + action
    reward = env.get_rewards[next_state]
    Q[state, action] = Q[state, action] + alpha * (reward + gamma * np.max(Q[next_state]) - Q[state, action])
    state = next_state

In [55]:
# TEST IMPLEMENTATION
print(f'Q-Value function learned:\n{Q}')

Q-Value function learned:
[[ 4.58  6.2 ]
 [ 6.2   8.  ]
 [ 8.   10.  ]
 [ 0.    0.  ]]


## <FONT COLOR="orange">**EXERCISE 3**</FONT>
---
---

Implement the Sarsa algorithm to
compare it with Q-Learning in the same
around. Shows how to update the
Q-value function and compare the results

In [56]:
# INITIALIZE Q-TABLE
Q = np.zeros((len(env.get_state_space), len(env.get_action_space)))

In [57]:
# TRAIN THE SARSA AGENT
for _ in range(1000):
  state = np.random.choice(env.get_state_space)          # Random initial state
  action = np.random.choice(env.get_action_space)        # Select random action
  while state != 3:
    next_state = state + action
    next_action = np.random.choice(env.get_action_space) # Select random next action
    reward = env.get_rewards[next_state]
    Q[state, action] = Q[state, action] + alpha * (reward + gamma * Q[next_state, next_action] - Q[state, action])
    state = next_state
    action = next_action

In [58]:
# TEST IMPLEMENTATION
print(f'Q-Value function learned with Sarsa:\n{Q}')

Q-Value function learned with Sarsa:
[[ 0.54618649  3.74783604]
 [ 3.85663046  7.22931891]
 [ 6.90560404 10.        ]
 [ 0.          0.        ]]


## <FONT COLOR="orange">**EXERCISE 4**</FONT>
---
---

Implements the optimization technique based on gradients to learn a policy in the same environment. Shows how policy parameters are updated using gradient ascending.

In [76]:
# INITIALIZE POLICY WITH UNIFORM PROBABILITIES
policy = np.ones(
  (
    len(env.get_state_space),
    len(env.get_action_space)
  )
)/len(env.get_action_space)

In [60]:
# AVERAGE REWARD
def average_reward(Q):
  return np.mean(
    [
      Q[
        state,
        np.argmax(policy[state])
      ] for state in env.get_state_space
    ]
  )

In [None]:
# TRAIN AGENT WITH MONTECARLO'S GRADIENT
for _ in range(1000):
  state = np.random.choice(env.get_state_space)           # Initial State
  while state != 3:                                       # Objetive State
    # SELECT AN ACTION RELY ON THE ACTUAL POLICY
    action = np.random.choice(
      env.get_action_space,
      p=policy[state]
    )
    next_state = state + action                           # Identify next state
    reward = env.get_rewards[next_state]                  # Obtain reward
    gradient = np.zeros_like(policy[state])
    gradient[action] = 1
    alpha = 0.01                                          # Learning rate
    policy[state] += alpha * gradient * (reward - average_reward(Q))
    policy[state] = np.clip(policy[state], 0, 1)          # Clip policy values
    policy[state] = policy[state] / np.sum(policy[state]) # Policy normalization
    state = next_state