# Frozen Lake Q-Learning

Q-learning is a model-free reinforcement learning algorithm used to find the optimal action-selection policy for a given environment. It learns by iteratively updating estimates of the optimal action-value function $Q∗(s,a)$, which represents the expected cumulative future rewards of taking action aa in state ss and then following the optimal policy thereafter.
<hr>

Why I choose Frozen lake environment ?
- I choose Frozen lake because, Q-Learning works when the observation state has discrete values rather than continous.

<hr>

## Importing Libraries

In [1]:
import gym
import numpy as np
import matplotlib.pyplot as plt

In [2]:
env = gym.make('FrozenLake-v1')

## Declaring Parameters

In [3]:
# Hyperparameters
alpha = 0.1  # Learning rate
gamma = 0.9  # Discount factor
epsilon = 1.0  # Exploration rate (decays over time)
max_epsilon = 1.0  # Initial exploration rate
min_epsilon = 0.01  # Minimum exploration rate
decay_rate = 0.01  # Decay rate for epsilon
num_episodes = 1000

## Making the Q-Table

In [4]:
# Q-table initialization
Q = np.zeros([env.observation_space.n, env.action_space.n])

# Reward per episode record
rewards_per_episode = []

## Running the episode to learn and update Q Table

In [12]:
for episode in range(num_episodes):
  # Reset environment for each episode
  state = env.reset()
  done = False
  total_reward = 0

  # Epsilon decay for exploration vs exploitation balance
  epsilon = max_epsilon - (max_epsilon - min_epsilon) * episode / num_episodes

  while not done:
    # Choose action based on epsilon-greedy strategy
    if np.random.rand() < epsilon:
      action = env.action_space.sample()  # Explore random action
    else:
      action = np.argmax(Q[state, :])  # Exploit best known action

    # Take action, observe next state, reward, and done flag
    next_state, reward, done, truncated, info = env.step(action)

    # Update Q-table using Bellman equation for Q-learning
    expected_future_reward = np.max(Q[next_state, :]) if not done else 0
    Q[state, action] += alpha * (reward + gamma * expected_future_reward - Q[state[0], action])

    # Update total reward for the episode
    total_reward += reward

    state = next_state

  # Append the total reward for the episode
  rewards_per_episode.append(total_reward)

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices