# Week 1: Q-Learning on FrozenLake

Welcome to Week 1! This week we’ll get hands-on with **Q-learning** using OpenAI Gym’s **FrozenLake-v1** environment. Your goal is to train a tabular Q-learning agent to navigate a slippery frozen lake without falling into holes.

Read about the FrozenLake environment from the openAI gymnasium website and try to solve this assignment

## Goals

- Understand and implement Q-learning with a Q-table.
- Use an ε-greedy exploration strategy.
- Visualize training progress with reward curves.
- Evaluate the learned policy.


## Environment Setup

In [None]:
%pip install numpy==1.23.5 gym==0.26.2 matplotlib


In [None]:
import gym
import numpy as np
import matplotlib.pyplot as plt

env = gym.make("FrozenLake-v1", is_slippery=True, render_mode="ansi")
state_space_size = env.observation_space.n
action_space_size = env.action_space.n
Q = np.zeros((state_space_size, action_space_size))

## Training Loop
Complete the training loop

In [None]:
def train_agent(episodes=2000, alpha=0.8, gamma=0.95, epsilon=1.0, decay=0.995):
    rewards = []
    for ep in range(episodes):
        state = env.reset()[0]
        done = False
        total_reward = 0
        while not done:
            #TODO select action
            if np.random.rand() < epsilon :
                action = env.action_space.sample()
            else :
                action = np.argmax(Q[state])

            next_state, reward, done, _, _ = env.step(action) #uncomment this after adding algo to choose action
            
            #TODO Update The Q values
            Q[state, action] = Q[state, action] + alpha * (reward + gamma * np.max(Q[next_state]) - Q[state, action])
            state = next_state
            total_reward += reward

        #TODO maybe try decaying epsilon between episodes
        epsilon*=decay
        if epsilon < 0.01 :
            epsilon = 0.01

        rewards.append(total_reward)
    return rewards

## Plotting Results

In [None]:
rewards = train_agent()
plt.plot(np.convolve(rewards, np.ones(100)/100, mode='valid'))
plt.title("100-Episode Moving Average of Rewards")
plt.xlabel("Episode")
plt.ylabel("Average Reward")
plt.show()

## Test the Learned Policy

In [None]:
def test_agent(Q, episodes=5):
    for ep in range(episodes):
        state = env.reset()[0]
        done = False
        total_reward = 0
        step = 0
        print(f"\n===== Episode {ep + 1} =====")
        while not done:
            action = np.argmax(Q[state])
            next_state, reward, done, _, _ = env.step(action)

            print(f"Step {step}: State {state} -> Action {action} -> Next State {next_state}, Reward: {reward}")
            print(env.render())  # render() returns a string in 'ansi' mode
            state = next_state
            total_reward += reward
            step += 1

        outcome = "Success 🎉" if reward == 1.0 else "Failure 💥"
        print(f"Episode {ep + 1} ended in {step} steps with total reward: {total_reward} ({outcome})")


In [None]:
test_agent(Q)

## Challenges

1. Set `is_slippery=False` and compare performance.
2. Change the reward for falling into holes.
3. Add a decaying learning rate `α = α0 / (1 + decay * t)`.
4. Visualize the Q-table as a heatmap (optional).
5. Maybe try to think about how to generalize this to solve a random lake without pretraining on the specific environment(Post your ideas on the whatsapp group and we will host a competition if people are interested)


## TLDR

Learn how to implement tabular Q-learning to solve a simple environment. Use exploration, value updates, and reward tracking to build intuition before moving to deep RL.

