# <FONT COLOR="red">***EXAMPLE - SIMPLE GRIDWORLD AGENT***</FONT>
---
---
In this demonstration, I learn how to implement the Q-Learning algorithm in an environment of a simple gridworld. Gridworld is a grid where an agent can move between cells, with the objective of reaching a target cell while avoiding obstacles. Through the implementation of Q-Learning, I understand how to teach the agent to make decisions based on the received rewards in each state and action.

## <FONT COLOR="orange">**Exercise 1: Implementation of Q-Learning in a Simple Gridworld environment**</FONT>
---
---
The first step is always importing the libraries. When we work with Reinforcement Learning (RL) in Google Colab, we use the NumPy library to manipulate the data in the simplest version.

In [1]:
# IMPORT COMMON LIBRARIES
import numpy as np

In [2]:
# GRIDWORLD DEFINITION
gridworld = np.array([
  [-1,-1,-1,1],
  [-1,-1,-1,-1],
  [-1,-1,-1,-1],
  [-1,-1,-1,-1],
])

In [3]:
# DEFINITON OF THE POSSIBLE ACTIONS
actions = [
  (0,-1), # Up
  (0,1), # Down
  (-1,0), # Left
  (1,0) # Right
]

In [8]:
# Q-LEARNING IMPLEMENTATION
Q = {}
for row in range(gridworld.shape[0]):
  for col in range(gridworld.shape[1]):
    Q[(row, col)] = [0] * len(actions)

In [5]:
# TRAIN PARAMETERS
gamma = 0.8
alpha = 0.1
num_episodes = 1000

In [9]:
# TRAIN LOOP
for _ in range(num_episodes):
  # Initial cell
  state = (0,0)
  while state != (0,3): # Target cell
    action = np.random.choice(range(len(actions)))
    new_row = state[0] + actions[action][0]
    new_col = state[1] + actions[action][1]
    if 0 <= new_row < gridworld.shape[0] and 0 <= new_col < gridworld.shape[1]:
      reward = gridworld[new_row,new_col]
      new_value = reward + gamma * np.max(Q[new_row,new_col])
      # Q-Larning Equation
      Q[state][action] = (1-alpha) * Q[state][action] + alpha * new_value
      # New State
      state = (new_row, new_col)

In [10]:
# Q-VALUES AFTER TRAINING
print('Q-Values after training:')
print(Q)

Q-Values after training:
{(0, 0): [0, -0.9999999999999994, 0, -0.9999999999999994], (0, 1): [-0.9999999999999994, -0.20000000000000062, 0, -1.7999999999999987], (0, 2): [-0.9999999999999994, 0.9999999999999994, 0, -1.1600000000000024], (0, 3): [0, 0, 0, 0], (1, 0): [0, -1.7999999999999987, -0.9999999999999994, -0.9999999999999994], (1, 1): [-0.9999999999999994, -1.1600000000000024, -0.9999999999999994, -1.7999999999999987], (1, 2): [-1.7999999999999987, -0.20000000000000062, -0.20000000000000062, -1.7999999999999987], (1, 3): [-1.1600000000000024, 0, 0.9999999999999994, -0.9999999999999994], (2, 0): [0, -1.7999999999999987, -0.9999999999999994, -0.9999999999999994], (2, 1): [-0.9999999999999994, -1.7999999999999987, -1.7999999999999987, -0.9999999999999994], (2, 2): [-1.7999999999999987, -0.9999999999999994, -1.1600000000000024, -0.9999999999999994], (2, 3): [-1.7999999999999987, 0, -0.20000000000000062, -0.9999999999999994], (3, 0): [0, -0.9999999999999994, -0.9999999999999994, 0], (3

## <FONT COLOR="orange">**Exercise 2: Reinforcement Learning Application in Games**</FONT>
---
---

In [11]:
# IMPORT COMMON LIBRARIES
import numpy as np

In [12]:
# REWARD DEFINITION
reward = {
  'win': 1,
  'lose': -1,
  'draw': 0
}

In [17]:
# IMPLEMENTATION OF THE Q-LEARNING
Q = {}
gamma = 0.8

In [22]:
# DEFINITION OF THE Q-LEARNING ALGORITHM
def q_learning_game(actual_state, action, new_state, result):
    # Convert actions to numerical indices
    actions = ['paper', 'rock', 'scissors']  # Define all possible actions
    action_index = actions.index(action)  # Get the numerical index of the current action

    if actual_state not in Q:
        Q[actual_state] = np.zeros(len(actions))
    if new_state not in Q:
        Q[new_state] = np.zeros(len(actions))

    # Use action_index for indexing
    new_value = reward[result] + gamma * np.max(Q[new_state])
    Q[actual_state][action_index] = (1 - alpha) * Q[actual_state][action_index] + alpha * new_value

In [23]:
# States are represented as the opponent's previous move (e.g., 'rock', 'paper', 'scissors')
# Actions are represented as the agent's move (e.g., 'rock', 'paper', 'scissors')
# Result is either 'win', 'lose', or 'draw'

# Example usage:
actual_state = 'rock'  # Opponent's previous move was 'rock'
action = 'paper'  # Agent chooses to play 'paper'
new_state = 'paper'  # Opponent's next move (unknown, could be anything)
result = 'win'  # Agent wins this round

q_learning_game(actual_state, action, new_state, result)

## <FONT COLOR="orange">**Reinforcement Learning Application in Robotics**</FONT>
---
---

In [24]:
# IMPORT COMMON LIBRARIES
import numpy as np

In [25]:
# DEFINITION OF THE NEVIGATION ENVIRONMENT
environment = np.array([
  [0,0,0,0,0],
  [0,-1,-1,-1,0],
  [0,0,-1,0,0],
  [0,-1,-1,-1,0],
  [0,0,0,0,0]
])

In [26]:
# DEFINITION OF THE POSSIBLE ACTIONS
actions = [
  (0,-1), # Up
  (0,1), # Down
  (-1,0), # Left
  (1,0) # Right
]

In [27]:
# Q-LEARNING IMPLEMENTATION
Q = np.zeros((environment.shape[0], environment.shape[1], len(actions)))

In [28]:
# TRAIN PARAMETERS
gamma = 0.9 # Discount factor
alpha = 0.1 # Learning rate
num_episodes = 1000

In [31]:
# TRAIN LOOP
for _ in range(num_episodes):
  # Initial State
  state = (0,0)
  while True:
    action = np.random.choice(range(len(actions)))
    new_row = state[0] + actions[action][0]
    new_col = state[1] + actions[action][1]
    if 0 <= new_row < environment.shape[0] and 0 <= new_col < environment.shape[1]:
      reward = environment[new_row,new_col]
      new_value = reward + gamma * np.max(Q[new_row,new_col])
      Q[state[0], state[1], action] = (1 - alpha) * Q[state[0], state[1], action] + alpha * new_value
      # New State
      state = (new_row, new_col)
      # Break Point Loop
      break

In [32]:
# Q-VALUES AFTER TRAIN
print('Q-Values after training:')
print(Q)

Q-Values after training:
[[[ 0.  0.  0.  0.]
  [ 0.  0.  0. -1.]
  [ 0.  0.  0. -1.]
  [ 0.  0.  0. -1.]
  [ 0.  0.  0.  0.]]

 [[ 0. -1.  0.  0.]
  [ 0. -1.  0.  0.]
  [-1. -1.  0. -1.]
  [-1.  0.  0.  0.]
  [-1.  0.  0.  0.]]

 [[ 0.  0.  0.  0.]
  [ 0. -1. -1. -1.]
  [ 0.  0. -1. -1.]
  [-1.  0. -1. -1.]
  [ 0.  0.  0.  0.]]

 [[ 0. -1.  0.  0.]
  [ 0. -1.  0.  0.]
  [-1. -1. -1.  0.]
  [-1.  0.  0.  0.]
  [-1.  0.  0.  0.]]

 [[ 0.  0.  0.  0.]
  [ 0.  0. -1.  0.]
  [ 0.  0. -1.  0.]
  [ 0.  0. -1.  0.]
  [ 0.  0.  0.  0.]]]


## <FONT COLOR="orange">**Exercise 4: Reinforcement Learning Application to Management Resources**</FONT>
---
---

In [37]:
# IMPORT COMMON LIBRARIES
import numpy as np

In [38]:
# DEFINITION OF THE STATES
states = ['low','medium','high']

In [44]:
# DEFINITION OF THE ACTIONS
actions = ['Replenish','Do not replenish']

In [40]:
# REWARD DEFINITION
rewards = {
  ('low','Replenish'): 50,
  ('low','Do not replenish'): -10,
  ('medium','Replenish'): 30,
  ('medium','Do not replenish'): 0,
  ('high','Replenish'): 10,
  ('high','Do not replenish'): -20
}

In [41]:
# Q-LEARNING IMPLEMENTATION
Q = {}

In [42]:
# TRAIN PARAMETERS
gamma = 0.9 # Discount factor
alpha = 0.1 # Learning rate
num_episodes = 1000

In [45]:
# TRAIN LOOP
for _ in range(num_episodes):
  actual_state = np.random.choice(states)
  while True:
    action = np.random.choice(actions)
    reward = rewards[(actual_state, action)]
    if actual_state not in Q:
      Q[actual_state] = {}
    if action not in Q[actual_state]:
      Q[actual_state][action] =0

    new_state = np.random.choice(states)
    max_new_state = max(Q[new_state].values()) if new_state in Q else 0
    Q[actual_state][action] += alpha * (reward + gamma * max_new_state - Q[actual_state][action])
    # New State
    actual_state = new_state
    # Break Point Loop
    if reward == 50 or reward == 30 or reward == 10:
      break

In [46]:
# Q-VALUES AFTER TRAIN
print('Q-Values after training:')
print(Q)

Q-Values after training:
{'low': {'Do not replenish': 243.55056025660724, 'Replenish': 295.51119427657306}, 'medium': {'Do not replenish': 251.72363992408145, 'Replenish': 284.0966048720111}, 'high': {'Replenish': 258.4381776620773, 'Do not replenish': 228.74590917475166}}
