# Q-Learning By Example



## A Simple Example

This environment consists of 6 states, labeled S0 through S5, arranged linearly:

```text
S0 — S1 — S2 — S3 — S4 — S5
```

- S2 is the start State. Every episode starts in this state.
- S0 and S5 are terminal (end) States. When either of these states is reached the episode ends.
- Rewards:
    - Reaching S0 gives a reward of -1
    - Reaching S5 gives a reward of +2
    - All other transitions yield 0 reward

**Actions**

In each non-terminal state (S1 to S4), the agent has two possible actions:

 - Action a1: Move left (to a lower-numbered state)
   _Example_: From S2, action a1 moves the agent to S1

 - Action a2: Move right (to a higher-numbered state)
    _Example_: From S2, action a2 moves the agent to S3


In [11]:
import numpy as np

In [12]:
rewards = [-1, 0, 0, 0, 0, 2]
q_values = [[0.0, 0.0],
            [0.0, 0.0],
            [0.0, 0.0],
            [0.0, 0.0],
            [0.0, 0.0],
            [0.0, 0.0]]
end_states = [True, False, False, False, False, True]

In [13]:
def epsilon_greedy_action(state_index: int, epsilon: float = 0.1) -> int:
    random = np.random.uniform()
    if random < epsilon:
        return np.random.randint(0, 2) # Random action
    else:
        return np.argmax(q_values[state_index]) # Best action

In [14]:
def take_action(current_state_index: int, action: int) -> tuple[int, int]:
    if action == 0:  # Move left
        next_state_index = current_state_index - 1
    else:  # Move right
        next_state_index = current_state_index + 1
    return rewards[next_state_index], next_state_index

In [15]:
NUM_EPISODES = 1000
GAMMA = 0.9
INITIAL_STATE_INDEX = 2
epsilon = 0.9
for episode in range(NUM_EPISODES):
    state_index = INITIAL_STATE_INDEX

    while not end_states[state_index]:
        action = epsilon_greedy_action(state_index, epsilon)
        reward, next_state_index = take_action(state_index, action)

        if end_states[next_state_index]:
            q_values[state_index][action] = reward
        else:
            q_values[state_index][action] = reward + GAMMA * np.max(q_values[next_state_index])

        state_index = next_state_index

    epsilon = epsilon - (1 / NUM_EPISODES)  # Decay epsilon

In [16]:
q_values

[[0.0, 0.0],
 [-1, np.float64(1.4580000000000002)],
 [np.float64(1.3122000000000003), np.float64(1.62)],
 [np.float64(1.4580000000000002), np.float64(1.8)],
 [np.float64(1.62), 2],
 [0.0, 0.0]]

In [22]:
state_index = INITIAL_STATE_INDEX
epsilon = 0
while not end_states[state_index]:
    action = epsilon_greedy_action(state_index, epsilon)
    reward, next_state_index = take_action(state_index, action)
    print(f"Taking action {action} in state {state_index}, received reward {reward}, moving to state {next_state_index}")
    state_index = next_state_index


Taking action 1 in state 2, received reward 0, moving to state 3
Taking action 1 in state 3, received reward 0, moving to state 4
Taking action 1 in state 4, received reward 2, moving to state 5
