In [1]:
import numpy as np

# Define the environment
num_states = 5
num_actions = 2
gamma = 0.9  # discount factor

# Initialize the Q-table with zeros
Q = np.zeros((num_states, num_actions))

# Define the reward matrix
rewards = np.array([
    [-1, 0],
    [-1, -1],
    [0, -1],
    [-1, 1],
    [0, 0]
])

# Define the transition matrix
transitions = np.array([
    [0, 1],
    [0, 2],
    [1, 3],
    [2, 4],
    [3, 4]
])

# Training the Q-learning algorithm
num_episodes = 1000

for episode in range(num_episodes):
    state = np.random.randint(0, num_states)  # Randomly initialize the starting state

    while True:
        action = np.argmax(Q[state, :]) if np.random.rand() < 0.9 else np.random.randint(0, num_actions)

        next_state = transitions[state, action]
        reward = rewards[state, action]

        # Update Q-value using the Bellman equation
        Q[state, action] = reward + gamma * np.max(Q[next_state, :])

        state = next_state

        if state == 4:  # Terminal state
            break

# Testing the learned policy
current_state = 0
path = [current_state]

while current_state != 4:
    action = np.argmax(Q[current_state, :])
    current_state = transitions[current_state, action]
    path.append(current_state)

print("Learned Q-values:")
print(Q)
print("Optimal path:")
print(path)


Learned Q-values:
[[0.91415789 2.12684211]
 [0.91415789 2.36315789]
 [2.12684211 3.73684211]
 [2.36315789 5.26315789]
 [4.73684211 4.26315789]]
Optimal path:
[0, 1, 2, 3, 4]


Certainly! Here's a step-by-step breakdown of the Q-learning algorithm implemented in the code:

1. **Environment Definition:**
   - Define the number of states (`num_states`) and actions (`num_actions`) in the environment.
   - Set the discount factor (`gamma`), which determines the importance of future rewards.

2. **Q-table Initialization:**
   - Initialize a Q-table (`Q`) with zeros.

3. **Environment Dynamics Definition:**
   - Define the reward matrix and transition matrix 
4. **Training the Q-learning Algorithm:**
   - Specify the number of training episodes (`num_episodes`).
   - For each episode: Start from the initial state (`state = 0`).
   - While not in the terminal state Take the chosen action and observe the next state and reward
   - Update the Q-value for the current state-action pair using the Bellman equation

5. **Testing the Learned Policy:**
   - Initialize the current state for testing (`current_state = 0`).
   - While not in the terminal state (`current_state != 5`):
   - Choose the action with the maximum Q-value for the current state.
   - Take the chosen action and move to the next state.
   - Record the path of states.
   - Print the learned Q-values and the optimal path.