# Cliff Walking with Q-learning

In this part of the homework we will explore the Q-learning algorithm with the Cliff Walking environment from the Gym library.

First, install all the required libraries. 

In [None]:
%pip install gym[toy_text]
%pip install gym
%pip install ipython

Now import the libraries that we are going to use in the code.

In [None]:
import numpy as np
import gym
import random
from tqdm import tqdm
import time
from IPython.display import clear_output

The first step in our implementation corresponds to creating the environment. The render mode "human" is the argument that will allow us to visualize the environment.

In [None]:
env=gym.make("CliffWalking-v0", render_mode="human").env 

The environment was created! Now let's explore... 
- **Find out the observation space**
- **Find out the action space**
- **Explain what do these concepts mean in this environment**

Hint: https://www.gymlibrary.dev/content/basic_usage/
- **Finally, explore the environment documentation and tell us how the reward is defined.**

Hint: https://www.gymlibrary.dev/environments/toy_text/cliff_walking/

In [None]:
#------------TO DO---------------#
state_space = None
#--------------------------------#

print("There are ", state_space, " possible states")

#------------TO DO---------------#
action_space = None
#--------------------------------#
print("There are ", action_space, " possible actions")
     

Now, we'll implement an algorithm for visualizing the environmnet before the training step. IMPORTANT: The 'pretraining' argument is the one that will allow us to implement the code in this stage of the homework. As we haven't defined the Q table yet, the q_table_cw parameter will be 0.

- **Please describe what is happening in the code.**

In [None]:
def visualize(episodes, max_steps, q_table_cw, pretraining=False):

    list_rewards= []
    
    for episode in range(episodes):
        state, info =env.reset()
        done=False
        print("EPISODE ", episode+1,)
        time.sleep(1)
        
        current_reward=0

        for step in range(max_steps):
            clear_output(wait=True)
            env.render()
            time.sleep(0.3)
            
            if pretraining==True:
                action = np.random.randint(0, 4)
            else:
                action = np.argmax(q_table_cw[state,:])
                
            new_state, reward, terminated, truncated ,info = env.step(action)
            
            current_reward += reward

            if terminated or truncated:
                clear_output(wait=True)
                env.render()
                print("Terminated or truncated")
                clear_output(wait=True)
                break

            state=new_state
            list_rewards.append(current_reward)
        print("Episode's reward:", current_reward)
        time.sleep(1)
    
    return list_rewards


# Visualize the environment before the training 

trying = visualize(5,15,0, pretraining=True)


env.close()


Now, **complete the function in order to initialize our Q-table as zeroes.**

In [None]:
def initialize_q_table(state_space, action_space):
    #------------TO DO---------------#
    Qtable = None
    #--------------------------------#
    return Qtable

In [None]:
q_table_cw = initialize_q_table(state_space, action_space)
print("Q-table shape: ", q_table_cw.shape)
print("Q-table: ", q_table_cw)

Now, we will create the environment again without the Render argument. This is done in order to implement the training in a much faster way. We will also define the parameters that will be needed in the next part.

In [None]:
env=gym.make("CliffWalking-v0").env 


# Training parameters
training_episodes = 50000     # Total training episodes
learning_rate = 0.1           # Learning rate

# Environment parameters
max_steps = 20              # Max steps per episode
gamma = 0.95                # Discounting rate

# Exploration parameters
max_epsilon = 1.0           # Exploration probability at start
min_epsilon = 0.05          # Minimum exploration probability
decay_rate = 0.001          # Exponential decay rate for exploration prob


Now we will implement our epsilon_greedy_policy algorithms. **Please explain the relation of the next two functions with the "Exploration" and "Exploitation" Concepts.**

In [None]:
def greedy_policy(q_table_cw, state):
    action = np.argmax(q_table_cw[state][:])
    return action

In [None]:
def epsilon_greedy_policy(Qtable, state, epsilon):
  
    random_num = random.uniform(0,1)
  
    if random_num > epsilon:
        action = greedy_policy(Qtable, state)
    else:
        action = env.action_space.sample()

    return action

Now we are ready to implement our training process!!!

In [None]:
def train(training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, q_table_cw):
    
    for episode in tqdm(range(training_episodes)):
        
        # Reduce epsilon (because we need less and less exploration)
        epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)

        
    # Reset the environment
        state, info = env.reset()
        terminated = False
        truncated = False

        # repeat
        for step in range(max_steps):

          # Choose the action At using epsilon greedy policy
            action = epsilon_greedy_policy(q_table_cw, state, epsilon)

          # Take action At and observe Rt+1 and St+1
            new_state, reward, terminated, truncated, info = env.step(action)


          # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
            q_table_cw[state,action] = q_table_cw[state,action] + learning_rate * (reward + gamma * np.max(q_table_cw[new_state,:]) - q_table_cw[state,action])

          # If terminated or truncated finish the episode
            if terminated or truncated:
                break
            

              # Our next state is the new state
            state = new_state
    return q_table_cw

In [None]:
q_table_cw = train(training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, q_table_cw)
q_table_cw

Finally, we create the environment once again in order to visualize it. Note that now we are using the updated Q-table and therefore we won't need the "pretraining" parameter.

You should be able to visualize the results now!!!.

In [None]:
env=gym.make("CliffWalking-v0", render_mode="human").env #Creation of the environment for visualization


rewards = visualize(5,20,q_table_cw)


env.close()
    

We hope you enjoyed this part of the homework. 

Now is your turn to experiment:
- **Run at least 8 experiments varying the number of training episodes and any other parameters you find interesting. Please add a table with the final rewards obtained in your experiments and an analysis of the results in your document.**

***IMPORTANT: Before uploading the modified .ipybn files be sure to clear all the outputs.***