Esta notebook contiene bloques de código útiles para realizar Q-learning en el entorno "Continuous Mountain Car"

In [1]:
from tqdm import tqdm
import numpy as np
import wandb
import gym
from car_model import Car
from mountain_car_agent import MountainCarAgent

In [2]:
from continuous_mountain_car_env_extended import ContinuousMountainCarEnvExtended

# Cambiar render_mode a rgb_array para entrenar/testear
env = ContinuousMountainCarEnvExtended(render_mode='rgb_array')

In [3]:
x_bins = 20  # Number of bins for position
vel_bins = 20  # Number of bins for velocity
action_bins = 5  # Number of discrete actions to sample from
model = Car(env, x_bins, vel_bins, action_bins)

In [4]:
alpha = 0.1  # Learning rate
gamma = 0.99  # Discount factor
agent = MountainCarAgent(model, alpha, gamma)

In [5]:
# Train the agent
num_training_episodes = 1000
epsilon = 0.2
average_training_rewards = agent.train(num_training_episodes, epsilon)
print(f"Average training reward over {num_training_episodes} episodes: {average_training_rewards}")

Training Progress: 100%|██████████| 1000/1000 [00:04<00:00, 239.04episode/s, Episode Reward=-463]

Average training reward over 1000 episodes: -177.768





In [6]:
# Evaluate the agent
num_evaluation_episodes = 100
average_evaluation_rewards = agent.test(num_evaluation_episodes)
print(f"Average evaluation reward over {num_evaluation_episodes} episodes: {average_evaluation_rewards}")

Average evaluation reward over 100 episodes: -558.95


Obtener el estado a partir de la observación

In [7]:
wandb.init(project="mountain_car",
           config={
               'x_bins': x_bins,
               'vel_bins': vel_bins,
               'action_bins': action_bins,
               'alpha': alpha,
               'gamma': gamma,
               'epsilon': epsilon,
           })

epsilon_initial = epsilon
for t in range(10):
    train_value = agent.train(100, epsilon_initial)
    eval_value = agent.test(30)
    wandb.log({'trainValue': train_value, 'evalValue': eval_value, "t": t})
    epsilon_initial *= 0.9  # Decay epsilon over iterations

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mmateogiraz27[0m ([33mmateogiraz27-ort[0m). Use [1m`wandb login --relogin`[0m to force relogin


Training Progress: 100%|██████████| 100/100 [00:00<00:00, 329.30episode/s, Episode Reward=-49]
Training Progress: 100%|██████████| 100/100 [00:00<00:00, 372.89episode/s, Episode Reward=-113]
Training Progress: 100%|██████████| 100/100 [00:00<00:00, 252.13episode/s, Episode Reward=-48]
Training Progress: 100%|██████████| 100/100 [00:00<00:00, 460.10episode/s, Episode Reward=-55]
Training Progress: 100%|██████████| 100/100 [00:00<00:00, 474.28episode/s, Episode Reward=-55]
Training Progress: 100%|██████████| 100/100 [00:00<00:00, 373.43episode/s, Episode Reward=19] 
Training Progress: 100%|██████████| 100/100 [00:00<00:00, 471.71episode/s, Episode Reward=-47]
Training Progress: 100%|██████████| 100/100 [00:00<00:00, 498.61episode/s, Episode Reward=23] 
Training Progress: 100%|██████████| 100/100 [00:00<00:00, 457.02episode/s, Episode Reward=8]  
Training Progress: 100%|██████████| 100/100 [00:00<00:00, 491.00episode/s, Episode Reward=27]
