# Lunar Lander

In [1]:
!pip install stable-baselines3[extra]
!pip install pyglet





### Install Swig

SWIG Installation: https://sourceforge.net/projects/swig/

Ubuntu/Linux: https://www.howtoinstall.me/ubuntu/18-04/swig/

MacBook: ``brew install swig``

Extra doc: https://github.com/pybox2d/pybox2d/blob/master/INSTALL.md

In [2]:
!pip install gym[box2d]



In [4]:
!pip install pygame



In [5]:
!pip install box2d-py



### 1. Import Libraries

In [8]:
import os
import gym
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3 import DQN

### 2. Load Environment

In [9]:
env = gym.make("LunarLander-v2")

### 3. Test Environment

In [16]:
episodes = 5

for episode in range(episodes):
    state = env.reset()
    reward_cnt = 0
    done = False
    
    while not done:
        env.render()
        action = env.action_space.sample()
        new_state, reward, done, info = env.step(action)
        
        state = new_state
        
        reward_cnt += reward
        
    print(f"Episode: {episode} Score: {reward_cnt}")
    
env.close()

Episode: 0 Score: -156.48096243698
Episode: 1 Score: -55.87547921207003
Episode: 2 Score: -68.0280130859886
Episode: 3 Score: -121.66057392152138
Episode: 4 Score: -134.8971213924894


#### 3.1 Action Space

In [11]:
env.action_space

Discrete(4)

In [13]:
env.step(1)

(array([ 0.88401794, -0.14644936,  0.56038874, -0.13590546, -1.4344269 ,
         2.2873518 ,  1.        ,  0.        ], dtype=float32),
 -100,
 True,
 {})

#### 3.2 Observation Space

In [14]:
env.observation_space

Box([-inf -inf -inf -inf -inf -inf -inf -inf], [inf inf inf inf inf inf inf inf], (8,), float32)

In [15]:
env.observation_space.sample()

array([-0.5916364 , -0.96207124,  0.4644716 ,  0.07968568,  0.37826055,
        0.3829114 , -0.9656692 ,  0.7205434 ], dtype=float32)

### 4. Train RL Agent

In [17]:
log_path = os.path.join("train", "logs")
log_path

'train/logs'

In [19]:
env = gym.make("LunarLander-v2")
env = DummyVecEnv([lambda: env])
model = DQN(policy="MlpPolicy", env=env, tensorboard_log=log_path, verbose=1)

Using cuda device


In [21]:
model.learn(total_timesteps=100_000)

2022-07-09 13:02:07.453070: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1


Logging to train/logs/DQN_1
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.969    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 221      |
|    time_elapsed     | 1        |
|    total_timesteps  | 327      |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.929    |
| time/               |          |
|    episodes         | 8        |
|    fps              | 469      |
|    time_elapsed     | 1        |
|    total_timesteps  | 747      |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.896    |
| time/               |          |
|    episodes         | 12       |
|    fps              | 643      |
|    time_elapsed     | 1        |
|    total_timesteps  | 1096     |
----------------------------------
---------------------------

----------------------------------
| rollout/            |          |
|    exploration_rate | 0.0746   |
| time/               |          |
|    episodes         | 108      |
|    fps              | 2342     |
|    time_elapsed     | 4        |
|    total_timesteps  | 9741     |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 112      |
|    fps              | 2364     |
|    time_elapsed     | 4        |
|    total_timesteps  | 10060    |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 116      |
|    fps              | 2389     |
|    time_elapsed     | 4        |
|    total_timesteps  | 10377    |
----------------------------------
----------------------------------
| rollout/          

----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 216      |
|    fps              | 2765     |
|    time_elapsed     | 7        |
|    total_timesteps  | 19362    |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 220      |
|    fps              | 2778     |
|    time_elapsed     | 7        |
|    total_timesteps  | 19766    |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 224      |
|    fps              | 2790     |
|    time_elapsed     | 7        |
|    total_timesteps  | 20079    |
----------------------------------
----------------------------------
| rollout/          

----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 324      |
|    fps              | 3015     |
|    time_elapsed     | 9        |
|    total_timesteps  | 28838    |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 328      |
|    fps              | 3024     |
|    time_elapsed     | 9        |
|    total_timesteps  | 29250    |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 332      |
|    fps              | 3030     |
|    time_elapsed     | 9        |
|    total_timesteps  | 29660    |
----------------------------------
----------------------------------
| rollout/          

----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 432      |
|    fps              | 3143     |
|    time_elapsed     | 12       |
|    total_timesteps  | 38746    |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 436      |
|    fps              | 3144     |
|    time_elapsed     | 12       |
|    total_timesteps  | 39146    |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 440      |
|    fps              | 3148     |
|    time_elapsed     | 12       |
|    total_timesteps  | 39533    |
----------------------------------
----------------------------------
| rollout/          

----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 540      |
|    fps              | 3203     |
|    time_elapsed     | 15       |
|    total_timesteps  | 48484    |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 544      |
|    fps              | 3206     |
|    time_elapsed     | 15       |
|    total_timesteps  | 48787    |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 548      |
|    fps              | 3208     |
|    time_elapsed     | 15       |
|    total_timesteps  | 49230    |
----------------------------------
----------------------------------
| rollout/          

----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 620      |
|    fps              | 660      |
|    time_elapsed     | 136      |
|    total_timesteps  | 90105    |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 6.71     |
|    n_updates        | 10026    |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 624      |
|    fps              | 640      |
|    time_elapsed     | 146      |
|    total_timesteps  | 94105    |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 1.07     |
|    n_updates        | 11026    |
----------------------------------
----------------------------------
| rollout/            |          |
|    exploration_rat

<stable_baselines3.dqn.dqn.DQN at 0x7f381aaa3280>

### 5. Testing Trained Agent

In [32]:
episodes = 5

for episode in range(episodes):
    state = env.reset()
    reward_cnt = 0
    done = False
    
    while not done:
        env.render()
        action, _ = model.predict(state)
        new_state, reward, done, info = env.step(action)
        
        state = new_state
        
        reward_cnt += reward
        
    print(f"Episode: {episode} Score: {reward_cnt}")
    
env.close()

Episode: 0 Score: [-56.39176]
Episode: 1 Score: [-125.84363]
Episode: 2 Score: [-114.66706]
Episode: 3 Score: [-109.27242]
Episode: 4 Score: [-109.41255]


In [25]:
env.close()

### 6. Tune Performance

In [26]:
from stable_baselines3.common.callbacks import EvalCallback, StopTrainingOnRewardThreshold

In [27]:
stop_callback = StopTrainingOnRewardThreshold(reward_threshold=150, verbose=1)

In [28]:
save_path = os.path.join("train", "saved_models")

In [29]:
eval_callback = EvalCallback(env,
                            callback_on_new_best=stop_callback,
                            eval_freq=10_000,
                            best_model_save_path=save_path,
                            verbose=1)

In [30]:
model = DQN(policy="MlpPolicy", env=env, tensorboard_log=log_path, verbose=1)

Using cuda device


In [31]:
model.learn(total_timesteps=200_000, log_interval=10_000, callback=eval_callback)

Logging to train/logs/DQN_2




Eval num_timesteps=1000, episode_reward=-560.25 +/- 185.95
Episode length: 127.00 +/- 14.45
----------------------------------
| eval/               |          |
|    mean_ep_length   | 127      |
|    mean_reward      | -560     |
| rollout/            |          |
|    exploration_rate | 0.953    |
| time/               |          |
|    total_timesteps  | 1000     |
----------------------------------
New best mean reward!
Eval num_timesteps=2000, episode_reward=-471.06 +/- 109.45
Episode length: 116.20 +/- 39.72
----------------------------------
| eval/               |          |
|    mean_ep_length   | 116      |
|    mean_reward      | -471     |
| rollout/            |          |
|    exploration_rate | 0.905    |
| time/               |          |
|    total_timesteps  | 2000     |
----------------------------------
New best mean reward!
Eval num_timesteps=3000, episode_reward=-653.43 +/- 218.81
Episode length: 188.80 +/- 83.90
----------------------------------
| eval/        

Eval num_timesteps=21000, episode_reward=-567.50 +/- 263.93
Episode length: 126.40 +/- 45.51
----------------------------------
| eval/               |          |
|    mean_ep_length   | 126      |
|    mean_reward      | -567     |
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    total_timesteps  | 21000    |
----------------------------------
Eval num_timesteps=22000, episode_reward=-327.48 +/- 141.28
Episode length: 112.60 +/- 52.58
----------------------------------
| eval/               |          |
|    mean_ep_length   | 113      |
|    mean_reward      | -327     |
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    total_timesteps  | 22000    |
----------------------------------
New best mean reward!
Eval num_timesteps=23000, episode_reward=-410.01 +/- 125.48
Episode length: 105.20 +/- 26.66
----------------------------------
| eval/               |          |

Eval num_timesteps=42000, episode_reward=-576.75 +/- 142.63
Episode length: 143.60 +/- 64.79
----------------------------------
| eval/               |          |
|    mean_ep_length   | 144      |
|    mean_reward      | -577     |
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    total_timesteps  | 42000    |
----------------------------------
Eval num_timesteps=43000, episode_reward=-674.64 +/- 148.29
Episode length: 120.80 +/- 17.63
----------------------------------
| eval/               |          |
|    mean_ep_length   | 121      |
|    mean_reward      | -675     |
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    total_timesteps  | 43000    |
----------------------------------
Eval num_timesteps=44000, episode_reward=-703.29 +/- 194.83
Episode length: 143.40 +/- 49.34
----------------------------------
| eval/               |          |
|    mean_ep_length  

New best mean reward!
Eval num_timesteps=60000, episode_reward=-148.73 +/- 53.24
Episode length: 424.60 +/- 189.99
----------------------------------
| eval/               |          |
|    mean_ep_length   | 425      |
|    mean_reward      | -149     |
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    total_timesteps  | 60000    |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 0.554    |
|    n_updates        | 2499     |
----------------------------------
Eval num_timesteps=61000, episode_reward=-207.59 +/- 73.48
Episode length: 749.80 +/- 306.76
----------------------------------
| eval/               |          |
|    mean_ep_length   | 750      |
|    mean_reward      | -208     |
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    total_timesteps  | 61000    |
| train/              |          |
|    learning_rate    

Eval num_timesteps=75000, episode_reward=-112.06 +/- 22.20
Episode length: 827.80 +/- 344.40
----------------------------------
| eval/               |          |
|    mean_ep_length   | 828      |
|    mean_reward      | -112     |
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    total_timesteps  | 75000    |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 0.542    |
|    n_updates        | 6249     |
----------------------------------
Eval num_timesteps=76000, episode_reward=-96.25 +/- 13.02
Episode length: 1000.00 +/- 0.00
----------------------------------
| eval/               |          |
|    mean_ep_length   | 1e+03    |
|    mean_reward      | -96.3    |
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    total_timesteps  | 76000    |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss  

Eval num_timesteps=90000, episode_reward=-282.91 +/- 127.53
Episode length: 544.60 +/- 376.58
----------------------------------
| eval/               |          |
|    mean_ep_length   | 545      |
|    mean_reward      | -283     |
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    total_timesteps  | 90000    |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 0.557    |
|    n_updates        | 9999     |
----------------------------------
Eval num_timesteps=91000, episode_reward=-126.97 +/- 108.00
Episode length: 502.40 +/- 407.65
----------------------------------
| eval/               |          |
|    mean_ep_length   | 502      |
|    mean_reward      | -127     |
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    total_timesteps  | 91000    |
| train/              |          |
|    learning_rate    | 0.0001   |
|    lo

Eval num_timesteps=105000, episode_reward=-195.52 +/- 161.40
Episode length: 864.00 +/- 272.00
----------------------------------
| eval/               |          |
|    mean_ep_length   | 864      |
|    mean_reward      | -196     |
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    total_timesteps  | 105000   |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 0.599    |
|    n_updates        | 13749    |
----------------------------------
Eval num_timesteps=106000, episode_reward=-116.65 +/- 23.65
Episode length: 1000.00 +/- 0.00
----------------------------------
| eval/               |          |
|    mean_ep_length   | 1e+03    |
|    mean_reward      | -117     |
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    total_timesteps  | 106000   |
| train/              |          |
|    learning_rate    | 0.0001   |
|    lo

Eval num_timesteps=120000, episode_reward=-123.92 +/- 14.71
Episode length: 1000.00 +/- 0.00
----------------------------------
| eval/               |          |
|    mean_ep_length   | 1e+03    |
|    mean_reward      | -124     |
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    total_timesteps  | 120000   |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 0.501    |
|    n_updates        | 17499    |
----------------------------------
Eval num_timesteps=121000, episode_reward=-112.67 +/- 21.50
Episode length: 1000.00 +/- 0.00
----------------------------------
| eval/               |          |
|    mean_ep_length   | 1e+03    |
|    mean_reward      | -113     |
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    total_timesteps  | 121000   |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss

Eval num_timesteps=135000, episode_reward=-103.61 +/- 12.78
Episode length: 1000.00 +/- 0.00
----------------------------------
| eval/               |          |
|    mean_ep_length   | 1e+03    |
|    mean_reward      | -104     |
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    total_timesteps  | 135000   |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 0.617    |
|    n_updates        | 21249    |
----------------------------------
Eval num_timesteps=136000, episode_reward=-120.45 +/- 12.88
Episode length: 1000.00 +/- 0.00
----------------------------------
| eval/               |          |
|    mean_ep_length   | 1e+03    |
|    mean_reward      | -120     |
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    total_timesteps  | 136000   |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss

Eval num_timesteps=150000, episode_reward=-93.07 +/- 19.24
Episode length: 1000.00 +/- 0.00
----------------------------------
| eval/               |          |
|    mean_ep_length   | 1e+03    |
|    mean_reward      | -93.1    |
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    total_timesteps  | 150000   |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 1.57     |
|    n_updates        | 24999    |
----------------------------------
Eval num_timesteps=151000, episode_reward=-92.31 +/- 18.26
Episode length: 1000.00 +/- 0.00
----------------------------------
| eval/               |          |
|    mean_ep_length   | 1e+03    |
|    mean_reward      | -92.3    |
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    total_timesteps  | 151000   |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss  

Eval num_timesteps=165000, episode_reward=-116.51 +/- 30.63
Episode length: 1000.00 +/- 0.00
----------------------------------
| eval/               |          |
|    mean_ep_length   | 1e+03    |
|    mean_reward      | -117     |
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    total_timesteps  | 165000   |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 0.708    |
|    n_updates        | 28749    |
----------------------------------
Eval num_timesteps=166000, episode_reward=-82.70 +/- 26.31
Episode length: 1000.00 +/- 0.00
----------------------------------
| eval/               |          |
|    mean_ep_length   | 1e+03    |
|    mean_reward      | -82.7    |
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    total_timesteps  | 166000   |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss 

Eval num_timesteps=180000, episode_reward=-99.44 +/- 47.27
Episode length: 994.00 +/- 12.00
----------------------------------
| eval/               |          |
|    mean_ep_length   | 994      |
|    mean_reward      | -99.4    |
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    total_timesteps  | 180000   |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 10.4     |
|    n_updates        | 32499    |
----------------------------------
Eval num_timesteps=181000, episode_reward=-118.49 +/- 17.63
Episode length: 1000.00 +/- 0.00
----------------------------------
| eval/               |          |
|    mean_ep_length   | 1e+03    |
|    mean_reward      | -118     |
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    total_timesteps  | 181000   |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss 

Eval num_timesteps=195000, episode_reward=-102.64 +/- 21.81
Episode length: 1000.00 +/- 0.00
----------------------------------
| eval/               |          |
|    mean_ep_length   | 1e+03    |
|    mean_reward      | -103     |
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    total_timesteps  | 195000   |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 0.449    |
|    n_updates        | 36249    |
----------------------------------
Eval num_timesteps=196000, episode_reward=-93.57 +/- 27.63
Episode length: 1000.00 +/- 0.00
----------------------------------
| eval/               |          |
|    mean_ep_length   | 1e+03    |
|    mean_reward      | -93.6    |
| rollout/            |          |
|    exploration_rate | 0.05     |
| time/               |          |
|    total_timesteps  | 196000   |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss 

<stable_baselines3.dqn.dqn.DQN at 0x7f37ceb056d0>

### 7. Save Model

In [34]:
MODEL_PATH = os.path.join("train", "saved_model", "DQN_lunar_lander_model")

In [35]:
model.save(MODEL_PATH)



### 8. Delete Model

In [36]:
del model

### 9. Reload Model

In [37]:
model = DQN.load(MODEL_PATH)

### 10. Evaluate Model

In [38]:
evaluate_policy(model, env, n_eval_episodes=10, render=True)



(-108.02370287410595, 31.90873397019299)

In [39]:
env.close()

### 11. View Performance On Tensorboard

In [40]:
!conda install -c conda-forge tensorboard -y

Collecting package metadata (current_repodata.json): done
Solving environment: | 
The environment is inconsistent, please check the package plan carefully
The following packages are causing the inconsistency:
/ 
  - defaults/linux-64::tensorflow-gpu==2.4.1=h30adc30_0
  - defaults/linux-64::tensorflow==2.4.1=gpu_py39h8236f22_0
  - defaults/linux-64::tensorflow-base==2.4.1=gpu_py39h29c2da4_0
  - defaults/linux-64::hdf5==1.10.6=hb1b8bf9_0
  - defaults/linux-64::scipy==1.7.3=py39hc147768_0
  - defaults/linux-64::libgfortran-ng==7.5.0=ha8ba4b0_17
  - anaconda/noarch::seaborn==0.11.2=pyhd3eb1b0_0
  - defaults/noarch::keras-preprocessing==1.1.2=pyhd3eb1b0_0
  - defaults/linux-64::h5py==2.10.0=py39hec9cf62_0
done


  current version: 4.8.2
  latest version: 4.13.0

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/prince/anaconda3/envs/tf-gpu

  added / updated specs:
    - tensorboard


The following packages will

In [45]:
training_log_path = os.path.join(log_path, "DQN_2")
training_log_path

'train/logs/DQN_2'

In [46]:
import tensorboard

In [47]:
!tensorboard --logdir={training_log_path}

2022-07-09 13:49:41.692690: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.4.0 at http://localhost:6006/ (Press CTRL+C to quit)
^C
