# Contents: Q-Learning

In this notebook, you are required to implement Q-Learning Reinforcement learning algorithm for Frozen Lake Environment.

Write the code to define and train the agent.
Make sure to include a visualization of the end result in form of a video.

## Frozen Lake

Frozen lake is a toy text environment involves crossing a frozen lake from start to goal without falling into any holes by walking over the frozen lake. <br>

We can also set the lake to be slippery so that the agent does not always move in the intended direction. \but here, we will only look at the non-slippery case. But you are welcome to try the slippery one.<br>

You can read more about the environment [here](https://gymnasium.farama.org/environments/toy_text/frozen_lake/).

![Frozen Lake](https://gymnasium.farama.org/_images/frozen_lake.gif)


## OpenAI Gym

[OpenAI Gym](https://www.gymlibrary.dev/) is a toolkit for developing and comparing reinforcement learning (RL) algorithms. It consists of a growing suite of environments (from simulated robots to Atari games), and a site for comparing and reproducing results. OpenAI Gym provides a diverse suite of environments that range from easy to difficult and involve many different kinds of data.

Creating and Interacting with gym environments is very simple.

```
import gym
env = gym.make("CartPole-v1")
observation, info = env.reset(seed=42)

for _ in range(1000):
    action = env.action_space.sample()
    observation, reward, done, truncated, info = env.step(action)

    if terminated or truncated:
        observation, info = env.reset()
env.close()
```

Following are the definitions of some common terminologies used.

**Reset:** Resets the environment to an initial state and returns the initial observation. <br>
**Step:** Run one timestep of the environment's dynamics.<br>
**Observation:** The observed state of the environment.<br>
**Action:** An action provided by the agent.<br>
**Reward:** The amount of reward returned as a result of taking the action.<br>
**Terminated:** Whether a terminal state (as defined under the MDP of the task) is reached.<br>
**Truncated:** Whether a truncation condition outside the scope of the MDP is satisfied. Typically a timelimit, but could also be used to indicate agent physically going out of bounds.<br>
**Info:** This contains auxiliary diagnostic information (helpful for debugging, learning, and logging).<br>
**Action Space:** This attribute gives the format of valid actions. It is of datatype Space provided by Gym. For example, if the action space is of type Discrete and gives the value Discrete(2), this means there are two valid discrete actions: 0 & 1.<br>
**Observation:** This attribute gives the format of valid observations. It is of datatype Space provided by Gym. For example, if the observation space is of type Box and the shape of the object is (4,), this denotes a valid observation will be an array of 4 numbers.<br>

Note: Previously, `terminated` and `truncated` used to be merged under one variable `done`. <br>


We will use OpenAI Gym for Frozen Lake environment.

In [None]:
# action = env.action_space.sample() _ take to action l and r
# Previously, terminated and truncated used to be merged under one variable done

## Creating the environment

In [7]:
import numpy as np
import gym
import random

In [8]:
# Create the environment
env = gym.make("FrozenLake-v1", is_slippery=False, new_step_api=True) # includes the next state, reward, and a boolean indicating whether the episode has ended.
# This parameter controls the slipperiness of the ice in the environment

### Solve here

write the code to define and train the agent:

In [9]:
#This line retrieves the number of possible states in the "FrozenLake-v1" environment.
state_size = env.observation_space.n
print(state_size)

16


In [10]:
#This line retrieves the number of possible actions that the agent can take in the "FrozenLake-v1" environment.
action_size = env.action_space.n
print(action_size)

4


In [11]:
qtable = np.zeros((state_size, action_size))
print(qtable) # 16x4

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


In [None]:
total_episodes = 20000 # the agent will be trained for 20,000 episodes.
learning_rate = 0.1 #determines the step size the agent takes when updating its value function or policy.
max_steps = 99 # max number of steps per episode. #This variable sets the maximum number of steps the agent can take in a single episode before the episode is terminated.
gamma = 0.95 # discount factor # determines the importance of future rewards compared to immediate rewards.

# Exploration parameters
#Epsilon controls the balance between exploration (trying new actions) and exploitation (using the current best known actions) during the learning process.
epsilon = 1.0 # Starting with a value of 1.0 means the agent will initially explore the environment randomly.

max_epsilon = 1.0
min_epsilon = 0.01
decay_rate = 0.001 # the exploration rate will decrease  based on this decay rate.

  and should_run_async(code)


In [None]:
# training loop for the reinforcement learning agent in the "FrozenLake-v1" environment.

rewards = []

for episode in range(total_episodes):
    state = env.reset() # start of each episode, the environment is reset, and the agent is placed in the initial state.
    step = 0 # number of steps taken in the current episode.
    done = False # indicates whether the episode has ended.
    total_rewards = 0 # total reward obtained in the current episode.
    for step in range(max_steps):
      exp_exp_tradeoff = random.uniform(0, 1)
      if exp_exp_tradeoff > epsilon:
        action = np.argmax(qtable[state,:])
        #random number is greater than the current value of epsilon, the agent will exploit the current knowledge in the Q-table and
        #choose the action that maximizes the Q-value.
      else:
        action = env.action_space.sample()
        #the agent will explore the environment by taking a random action.
      new_state, reward, terminated, truncated, info = env.step(action)

      done = terminated or truncated # terminated or truncated: The episode is considered done if the agent either reaches the goal (terminated) or
      #reaches the maximum number of steps (truncated). < run to object  or leave to room <
      qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * np.max(qtable[new_state, :]) * (1-done) - qtable[state, action]) #The update combines the immediate reward, the discounted maximum future reward, and the current Q-value.
      total_rewards += reward
      state = new_state # The agent's current state is updated to the new state.

      if done: #
        break
epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)
rewards.append(total_rewards)


qtable



array([[0.73509189, 0.77378094, 0.77378094, 0.73509189, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        ],
       [0.73509189, 0.        , 0.81450625, 0.77378094, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        ],
       [0.77378094, 0.857375  , 0.77378094, 0.81450625, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        ],
       [0.81450625, 0.        , 0.77378094, 0.77378094, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        ],
       [0.77378094, 0.81450625, 0.        , 0.73509189, 0.        ,
        0.        , 0.        , 

### Visualization

You are provided with some functions which will help you visualize the results as a video.
Feel free to wrie your own code for visualization if you prefer

In [None]:
# For visualization
from gym.wrappers.monitoring import video_recorder
from IPython.display import HTML
from IPython import display
import glob
import base64, io, os

os.environ['SDL_VIDEODRIVER']='dummy'

In [None]:
os.makedirs("video", exist_ok=True)

def show_video(env_name):
    # Function to show a video in the notebook. Do not modify.
    mp4list = glob.glob('video/*.mp4')
    if len(mp4list) > 0:
        mp4 = 'video/{}.mp4'.format(env_name)
        video = io.open(mp4, 'r+b').read()
        encoded = base64.b64encode(video)
        display.display(HTML(data='''<video alt="test" autoplay
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
    else:
        print("Could not find video")

def show_video_of_model(env_name, max_steps=100):
    vid = video_recorder.VideoRecorder(env, path="video/{}.mp4".format(env_name))
    state = env.reset()
    done = False
    for t in range(max_steps):
        vid.capture_frame()

        # Write your code to choose an action here.
        action = np.argmax(qtable[state,:])
        # action = env.action_space.sample()
        # print(action)


        next_state, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated
        print(f"state: {state}, action: {action}",next_state, reward, done)
        state = next_state
        if done:
            break
    vid.close()
    env.close()

In [None]:
show_video_of_model("FrozenLake-v1")

  logger.deprecation(
  logger.deprecation(
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  deprecation(


state: 0, action: 1 4 0.0 False
state: 4, action: 1 8 0.0 False
state: 8, action: 2 9 0.0 False
state: 9, action: 1 13 0.0 False
state: 13, action: 2 14 0.0 False
state: 14, action: 2 15 1.0 True


See here for more information: https://www.gymlibrary.ml/content/api/[0m
  deprecation(


In [None]:
show_video("FrozenLake-v1")

  and should_run_async(code)
