### Import policy, RL agent and Gymnasium

In [1]:
#pip install "gymnasium[box2d]"

In [2]:
import gymnasium as gym
import numpy as np

from deeprl import DQN

In this example, we will use the [Lunar Lander](https://gymnasium.farama.org/environments/box2d/lunar_lander/) environment from OpenAI Gymnasium. This environment features a discrete action space and a continuous state space. To solve it, we will leverage the `deeprl` library, which provides an implementation of the Deep Q-Network (DQN) algorithm.

Landing outside the landing pad is possible but with a penalty. Fuel is infinite, so an agent can learn to fly and then land on its first attempt. Four discrete actions are available: do nothing, fire left orientation engine, fire main engine, fire right orientation engine.

In [3]:
model = DQN(
    "MlpPolicy",
    "LunarLander-v3",
    verbose=1,
    exploration_final_eps=0.1,
    target_update_interval=250,
)

Using cuda device
Creating environment from the given name 'LunarLander-v3'
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


We select the `MlpPolicy` (Multi-Layer Perceptron policy) for this task because the input is a feature vector of length 8 rather than an image.

We load a function to evaluate the agent's performance before and after training.

In [4]:
from deeprl.common.evaluation import evaluate_policy

Lest's evaluate the agent's performance before training:

In [5]:
# Separate env for evaluation
eval_env = gym.make("LunarLander-v3")

# Random Agent, before training
mean_reward, std_reward = evaluate_policy(
    model,
    eval_env,
    n_eval_episodes=10,
    deterministic=True,
)

print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")



mean_reward=-554.75 +/- 183.47134486961076


The mean reward is $-554.75 \pm 183.471$ wich is very low. The agent is not able to land the lunar lander. 

#### Train the agent and save it


In [6]:
# Train the agent
model.learn(total_timesteps=int(1e5))
# Save the agent
model.save("dqn_lunar")
del model  # delete trained model to demonstrate loading

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 84       |
|    ep_rew_mean      | -210     |
|    exploration_rate | 0.97     |
| time/               |          |
|    episodes         | 4        |
|    fps              | 1357     |
|    time_elapsed     | 0        |
|    total_timesteps  | 336      |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 1.48     |
|    n_updates        | 58       |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 80.4     |
|    ep_rew_mean      | -167     |
|    exploration_rate | 0.942    |
| time/               |          |
|    episodes         | 8        |
|    fps              | 1541     |
|    time_elapsed     | 0        |
|    total_timesteps  | 643      |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 3.91     |
|    n_updates      

After train the model, we save and then delete the model to demostrate how to load the model and evaluate it.

In [9]:
model = DQN.load("dqn_lunar")

In [10]:
# Evaluate the trained agent
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)

print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")

mean_reward=-182.64 +/- 56.59685950385565


Now, we can see that the mean reward is better than before. The agent is able to do a better job at landing the lunar lander.

In [None]:
env = gym.make("LunarLander-v3", render_mode="human")

obs, info = env.reset()
while True:
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        obs, info = env.reset()
    