# Q-Learning on FrozenLake Environment (with Live Q-Table Visualization)

This notebook demonstrates how to apply the Q-learning algorithm on the FrozenLake-v1 environment using OpenAI Gym.  
We will train the agent, visualize the learning process, and finally test the learned policy with visual rendering.


In [None]:
import time  # for delay during plot updates
import gym  # OpenAI Gym environment
import matplotlib.pyplot as plt  # for Q-table heatmap visualization
import numpy as np  # numerical computations


## Environment Setup

We define two environments:
- One for training (no rendering)
- One for testing (with human-readable visualization)


In [None]:
env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=False)
test_env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=False, render_mode="human")


## Q-Table Initialization and Hyperparameters

We initialize the Q-table with zeros and define the learning rate, discount factor, exploration rate, and total episodes.


In [None]:
q_table = np.zeros((env.observation_space.n, env.action_space.n))

alpha = 0.3      # learning rate
gamma = 0.99     # discount factor
epsilon = 0.5    # exploration rate
episodes = 3000  # total training episodes


## Q-Table Heatmap Setup

We use matplotlib to visualize the agent's learning progress in real-time.


In [None]:
plt.ion()
fig, ax = plt.subplots()
img = ax.imshow(q_table, cmap='coolwarm', interpolation='nearest')
plt.colorbar(img)
plt.title("Q-Table Heatmap (State x Action)")

def update_plot():
    img.set_data(q_table)
    ax.set_xlabel("Actions (0:Left, 1:Down, 2:Right, 3:Up)")
    ax.set_ylabel("States (0-15)")
    fig.canvas.draw()
    fig.canvas.flush_events()
    time.sleep(0.01)


## Q-Learning Training Loop

The agent will explore and learn which actions yield the most rewards.


In [None]:
for episode in range(episodes):
    state = env.reset()[0]
    done = False

    while not done:
        if np.random.rand() < epsilon:
            action = env.action_space.sample()  # exploration
        else:
            action = np.argmax(q_table[state])  # exploitation

        next_state, reward, done, _, _ = env.step(action)

        q_table[state, action] = q_table[state, action] + alpha * (
            reward + gamma * np.max(q_table[next_state]) - q_table[state, action]
        )

        state = next_state

    if episode % 100 == 0:
        update_plot()


## Final Q-Table Plot

Let's show the final state of the Q-table after all episodes.


In [None]:
update_plot()
plt.ioff()
plt.show()


## Testing the Learned Policy

We now test the learned policy in the same environment, but with rendering turned on.


In [None]:
state = test_env.reset()[0]
test_env.render()

for _ in range(20):
    action = np.argmax(q_table[state])
    next_state, reward, done, _, _ = test_env.step(action)
    test_env.render()
    state = next_state
    if done:
        break
