<a href="https://colab.research.google.com/github/TBKHori/Music-Recon13/blob/main/Predicted_action_taken_according_to_q_values.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [15]:
!!apt-get install ffmpeg freeglut3-dev xvfb  # For visualization

!pip install "stable-baselines3[extra]>=2.0.0a4"
!pip install -U tensorboard-plugin-profile



Q-values for the initial state

Exercise (5 minutes): predict taken action according to q-values

Using the get_q_values() function, retrieve the q-values for the initial observation, print them for each action ("left", "nothing", "right") and print the action that the greedy (deterministic) policy would follow (i.e., the action with the highest q-value for that state).

In [98]:
import numpy as np
import torch as th
from stable_baselines3 import DQN
import gym

In [99]:
def preprocess_obs(obs, observation_space):
    # Normalize the observation to match the range of the observation space
    low, high = observation_space.low, observation_space.high
    return (obs - low) / (high - low)

In [100]:
def get_q_values(model: DQN, obs: np.ndarray) -> np.ndarray:
    """
    Retrieve Q-values for a given observation.

    :param model: a DQN model
    :param obs: a single observation
    :return: the associated q-values for the given observation
    """
    assert model.get_env().observation_space.contains(obs), f"Invalid observation of shape {obs.shape}: {obs}"

    # Preprocess the observation to match the observation space
    preprocessed_obs = preprocess_obs(obs, model.get_env().observation_space)

    # Convert the preprocessed observation to a PyTorch tensor
    obs_tensor = th.tensor(preprocessed_obs, dtype=th.float32).unsqueeze(0)

    # Disable gradient calculation and get q-values using the q-network
    with th.no_grad():
        q_values = model.q_net(obs_tensor).squeeze().cpu().numpy()  # Remove the extra dimension

    assert isinstance(q_values, np.ndarray), "The returned q_values is not a numpy array"
    assert q_values.shape == (model.action_space.n,), f"Wrong shape: ({model.action_space.n},) was expected but got {q_values.shape}"

    return q_values

In [101]:
# Initialize the environment
env = gym.make("MountainCar-v0", render_mode="rgb_array")


In [102]:
# Create the DQN model
dqn_model = DQN(
    "MlpPolicy",
    env,  # Make sure 'env' is defined and initialized
    verbose=1,
    train_freq=16,
    gradient_steps=8,
    gamma=0.99,
    exploration_fraction=0.2,
    exploration_final_eps=0.07,
    target_update_interval=600,
    learning_starts=1000,
    buffer_size=10000,
    batch_size=128,
    learning_rate=4e-3,
    policy_kwargs=dict(net_arch=[256, 256]),
    seed=2,
)

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [103]:
# Get the initial observation
obs = env.reset()

In [104]:
# Predict the taken action based on q-values
predicted_action = np.argmax(get_q_values(dqn_model, obs))
print("Q-Values:", get_q_values(dqn_model, obs))
print("Predicted Action:", predicted_action)

Q-Values: [ 0.04268264 -0.23019277 -0.10880074]
Predicted Action: 0
