# Cliff Walking and Q-Learning
Cliff Walking Problem:

The Cliff Walking environment is a grid world problem commonly used in reinforcement learning. In this environment, an agent must navigate from a start state to a goal state, typically located at opposite corners of a grid. The challenge lies in the presence of 'cliffs' along the way. If the agent falls off a cliff, it incurs a large negative reward and is sent back to the start state.

# Q-Learning:

Q-learning is a model-free reinforcement learning algorithm used to find the optimal action-selection policy for a given finite Markov decision process. It aims to learn a policy, which tells an agent what action to take under what circumstances. The agent learns to estimate the value of actions taken in states (Q-values) based on the rewards it receives for its actions. Over time, the agent learns the best actions to take in each state to maximize its cumulative reward.

# Imports:

`import gymnasium as gym:` Imports the Gymnasium library, which provides the Cliff Walking environment. Gymnasium is a toolkit for developing and comparing reinforcement learning algorithms.
`import numpy as np:` Imports the NumPy library, used for numerical operations like handling arrays (such as the Q-table).
`import random:` Imports the random library to perform `random` selections, which is crucial for the exploration part of the Q-learning algorithm.

In [8]:
import gymnasium as gym
import numpy as np
import random

# Function Definition: 

This line defines a function named `q_learning_epsilon_constant` with parameters `it_max`, `epsilon`, `learning_rate`, `discount_factor`, and an optional parameter `render` (defaulting to `False`). Each of these parameters plays a role in configuring the Q-learning algorithm.

# Environment Initialization:

Here, the `Cliff Walking environment` is created using Gymnasium. The `render_mode` is set based on the render parameter. If `render` is `True`, the environment will display its graphical representation.

# Defining States and Actions:

These lines fetch the number of states (`num_states`) and actions (`num_actions`) in the Cliff Walking environment. This information is used to structure the Q-table.

# Initializing Q-Table and Rewards Table:

The Q-table (`q_table`) is initialized with zeros and has a size equal to the number of states times the number of actions. The `rewards_table` is an array to keep track of the total rewards received in each episode.

# Episode Iteration:

This loop iterates over the number of episodes (`it_max`). For each episode, the environment is reset, and variables like `terminated`, `truncated`, and `total_rewards` are initialized. `observation` holds the initial state of the environment.

# Action Selection and Environment Step:

Inside each episode, the algorithm runs until the state is either `terminated` or `truncated` (like reaching a terminal state or falling off a cliff). It decides whether to take a random action (exploration, based on `epsilon`) or the best-known action (exploitation, based on the max Q-value for the current state). The environment then progresses to the next state (`next_observation`) with the selected action, returning the reward and new state information. This reward is added to the `total_rewards` for the episode.

These code sections collectively set up and run the Q-learning algorithm in the Cliff Walking environment, handling the learning process and interactions with the environment.

# Q-Learning Update:

`diff = reward + discount_factor * np.max(q_table[next_observation]) - q_table[observation, action]
q_table[observation][action] += learning_rate * diff
`
This block of code is the core of the Q-learning algorithm. For each step in an episode, it updates the Q-value for the current state-action pair. The update rule is based on the Bellman equation, where:

`reward:` The immediate reward received after performing the action.
`discount_factor * np.max(q_table[next_observation]):` The maximum predicted Q-value for the next state, weighted by the discount factor.
`q_table[observation, action]:` The current Q-value for the state-action pair.
`diff:` The difference between the estimated Q-value and the observed Q-value (the temporal difference error).
`learning_rate:` The rate at which the Q-table is updated. A higher learning rate means the algorithm quickly adopts new Q-values.

# Update State and Episode Handling:

`observation = next_observation
rewards_table[episode] = total_rewards
`
After the Q-table update, the current observation is updated to the next observation. At the end of each episode, the total rewards for that episode are stored in the `rewards_table`.

# Optimal Episode Check:

This part checks if the agent has achieved the best possible outcome for the first time. In the Cliff Walking environment, `-13` is typically the highest possible total reward (least negative). If this is achieved for the first time, the episode number is stored as the `optimal_episode`.

# Rendering and Closing the Environment:

`if render:
    env.render()
env.close()
`
If rendering is enabled, the environment is rendered after each episode. Finally, the environment is closed, which is important for resource cleanup.

# Return Statement:

`return q_table, rewards_table, optimal_episode
`
The function returns the learned Q-table, the rewards for each episode, and the first episode when the optimal path is found.

# Function Execution and Output:

Parameters for the Q-learning algorithm are defined (`it_max`, `epsilon`, `learning_rate`, `discount_factor`).
The Q-learning function is executed with these parameters.
The learned Q-table, rewards for each episode, the first optimal episode, the average reward, and the percentage of successful episodes are printed.
This code thus comprehensively covers the implementation and execution of the Q-learning algorithm in the Cliff Walking environment, including the learning process, environment interaction, and performance evaluation.

# Defining Parameters:

`it_max = 300
epsilon = 0.1
learning_rate = 0.9
discount_factor = 0.9
`
Here, you are setting up the parameters for the Q-learning algorithm.

`it_max:` The maximum number of episodes for training the Q-learning agent.
`epsilon:` The exploration rate. This is a probability value that dictates how often the agent will choose a random action over what it believes is the best action. This aids in exploring the state space.
`learning_rate:` Determines how much the Q-value is updated during learning. A higher value means faster learning, but can be less stable.
`discount_factor:` Used to balance immediate and future rewards. A value closer to 1 places more importance on future rewards.

# Executing Q-Learning:

`q_table, rewards_table, optimal_episode = q_learning_epsilon_constant(it_max, epsilon, learning_rate, discount_factor, render=True)
`
This line calls the `q_learning_epsilon_constant` function with the parameters you've defined. It also sets `render=True`, which means the algorithm's progress will be visually rendered. This function returns:

`q_table:` The final Q-table after training, representing the learned values for each action in each state.
`rewards_table:` An array containing the total reward accumulated in each episode.
`optimal_episode:` The first episode in which the agent achieved the best possible total reward.

# Code:

In [None]:
import gymnasium as gym
import numpy as np
import random

def q_learning_epsilon_constant(it_max, epsilon, learning_rate, discount_factor, render=False):
    env = gym.make('CliffWalking-v0', render_mode="human" if render else None)
    num_states = env.observation_space.n
    num_actions = env.action_space.n
    q_table = np.zeros((num_states, num_actions))
    rewards_table = np.zeros(it_max)
    optimal_episode = None

    for episode in range(it_max):
        terminated = False
        truncated = False
        total_rewards = 0
        observation, info = env.reset()

        while not terminated and not truncated:
            if render:
                env.render()

            if random.random() < epsilon:
                action = env.action_space.sample()
            else:
                action = np.argmax(q_table[observation])
            
            next_observation, reward, terminated, truncated, info = env.step(action)
            total_rewards += reward

            # Q-learning update
            diff = reward + discount_factor * np.max(q_table[next_observation]) - q_table[observation, action]
            q_table[observation][action] += learning_rate * diff

            observation = next_observation

        rewards_table[episode] = total_rewards

        if total_rewards == -13 and optimal_episode is None:
            optimal_episode = episode

        if render:
            env.render()

    env.close()
    return q_table, rewards_table, optimal_episode

# Define your parameters
it_max = 300
epsilon = 0.1
learning_rate = 0.9
discount_factor = 0.9

# Perform Q-learning with rendering
q_table, rewards_table, optimal_episode = q_learning_epsilon_constant(it_max, epsilon, learning_rate, discount_factor, render=True)


# Notice:
To show the `percent` of the rewards and success we can write `"rgb_array"` instead of the `"human"` in `render_mode`!