Import

In [1]:
import gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy
import optuna

Creation de l'environnement

In [2]:
environment_name = 'CarRacing-v0'
env = gym.make(environment_name)
print('Action space :', env.action_space, '||| Observation space (shape) :', env.observation_space.shape)

Action space : Box([-1.  0.  0.], [1. 1. 1.], (3,), float32) ||| Observation space (shape) : (96, 96, 3)


1er test avec un agent aléatoire

In [3]:
episodes = 2
for episode in range(1, episodes+1):
    state = env.reset()
    done = False
    score = 0 
    
    while not done:
        env.render()
        action = env.action_space.sample()
        n_state, reward, done, info = env.step(action)
        score+=reward
    print(f'Episode:{episode} Score:{score}')
env.close()

Track generation: 997..1252 -> 255-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1263..1583 -> 320-tiles track


2023-03-26 21:34:49.835 Python[26690:4922627] ApplePersistenceIgnoreState: Existing state will not be touched. New state will be written to /var/folders/y0/7mmq5kp12fq249dxyngjmwv80000gn/T/org.python.python.savedState


Episode:1 Score:-37.30407523511033
Track generation: 1157..1451 -> 294-tiles track
Episode:2 Score:-35.15358361774796
Track generation: 1216..1524 -> 308-tiles track
Episode:3 Score:-38.11074918566829


### Creation de notre 1er model de test :

Creation de notre environnement DummyVecEnv à entrainer

In [4]:
env = gym.make(environment_name)
env = DummyVecEnv([lambda: env])

Declaration du model de PPO avec stable baseline

In [5]:
model = PPO("CnnPolicy", env, verbose=1, device='mps')

Using mps device
Wrapping the env in a VecTransposeImage.


In [6]:
timesteps = 100000
model.learn(total_timesteps=timesteps)

Track generation: 1081..1355 -> 274-tiles track
Track generation: 1188..1493 -> 305-tiles track
Track generation: 1308..1639 -> 331-tiles track
-----------------------------
| time/              |      |
|    fps             | 134  |
|    iterations      | 1    |
|    time_elapsed    | 15   |
|    total_timesteps | 2048 |
-----------------------------
Track generation: 1259..1578 -> 319-tiles track
Track generation: 1351..1693 -> 342-tiles track
-----------------------------------------
| time/                   |             |
|    fps                  | 109         |
|    iterations           | 2           |
|    time_elapsed         | 37          |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.005691627 |
|    clip_fraction        | 0.0385      |
|    clip_range           | 0.2         |
|    entropy_loss         | -4.25       |
|    explained_variance   | 0.00618     |
|    learning_rate        | 0.0003      |
|   

<stable_baselines3.ppo.ppo.PPO at 0x2a32636d0>

On sauvegarde notre model avec son timestep et ses particularités

In [7]:
#model_path = f'models/PPO_100k_base__model'
#model.save(model_path)

On charge le model (si on veut selectionner le meilleur par exemple)

In [8]:
load_model_path = f'models/PPO_100k_base__model'
model = PPO.load(load_model_path, env=env)

Wrapping the env in a VecTransposeImage.


On evalue le model avec 2 mesures : 
 - La récompense moyenne
 - L'écart type

Une *récompense moyenne élevée* et un *écart type faible* sont signe d'une politique **performante et stable**

Tandis qu'une politique qui obtient une *récompense moyenne faible* et un *écart type élevé* est considérée comme **peu performante et instable**.

In [9]:
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=5, render=True)

Track generation: 1342..1685 -> 343-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1119..1406 -> 287-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1289..1615 -> 326-tiles track




Track generation: 1223..1533 -> 310-tiles track
Track generation: 1164..1459 -> 295-tiles track
Track generation: 1023..1290 -> 267-tiles track
Track generation: 1057..1325 -> 268-tiles track
Track generation: 1088..1364 -> 276-tiles track


In [10]:
print('Récompense moyenne :', mean_reward, '||| Écart type :', std_reward)

Récompense moyenne : 270.1245722711086 ||| Écart type : 118.42249644089515


## Optuna

Recherche d'optimisation des parametres avec Optuna :

In [None]:
study_name = "study_200k"
timesteps = 200000

In [11]:
def objective(trial):

    # Define hyperparameters
    learning_rate = trial.suggest_float('learning_rate', 1e-4, 1e-2)
    gamma = trial.suggest_float('gamma', 0.9, 0.99)
    clip_range = trial.suggest_float('clip_range', 0.1, 0.4)
    ent_coef = trial.suggest_float('ent_coef', 1e-4, 1e-3)

    environment_name = 'CarRacing-v0'
    env = gym.make(environment_name)
    env = DummyVecEnv([lambda: env])

    model = PPO("CnnPolicy", env, verbose=0, 
                learning_rate=learning_rate, gamma=gamma,
                clip_range=clip_range, ent_coef=ent_coef, device="mps")

    # Train model
    model.learn(total_timesteps=timesteps)

    # Save model
    model_path = f'models/{study_name}/PPO_{learning_rate}_{gamma}__{clip_range}_{ent_coef}__model'
    model.save(model_path)

    # Load model
    model = PPO.load(model_path, env=env)

    # Evaluate model
    mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=5)

    # Close environment
    env.close()

    # Return mean reward to maximize
    return mean_reward

In [13]:
study = optuna.create_study(storage="sqlite:///optunaRLCRstudy.db", study_name=study_name, direction='maximize')

study.optimize(objective, n_trials=10)

print(f"Meilleur score: {study.best_value}")
print(f"Meilleur hyperparameters: {study.best_params}")

[32m[I 2023-03-26 21:56:33,293][0m A new study created in RDB with name: study__200k[0m


Track generation: 1106..1394 -> 288-tiles track
Track generation: 1211..1518 -> 307-tiles track
Track generation: 1175..1472 -> 297-tiles track
Track generation: 1039..1308 -> 269-tiles track
Track generation: 1100..1379 -> 279-tiles track
Track generation: 1038..1308 -> 270-tiles track
Track generation: 1141..1436 -> 295-tiles track
Track generation: 1144..1434 -> 290-tiles track
Track generation: 1146..1436 -> 290-tiles track
Track generation: 1291..1618 -> 327-tiles track
Track generation: 1080..1358 -> 278-tiles track
Track generation: 1237..1557 -> 320-tiles track
Track generation: 1300..1629 -> 329-tiles track
Track generation: 1125..1410 -> 285-tiles track
Track generation: 1187..1488 -> 301-tiles track
Track generation: 1215..1523 -> 308-tiles track
Track generation: 962..1214 -> 252-tiles track
Track generation: 1336..1674 -> 338-tiles track
Track generation: 1256..1573 -> 317-tiles track
Track generation: 1212..1529 -> 317-tiles track
retry to generate track (normal if there 

[32m[I 2023-03-26 22:31:50,406][0m Trial 0 finished with value: 214.95445947945117 and parameters: {'learning_rate': 0.00039317898096148755, 'gamma': 0.9183776997781369, 'clip_range': 0.21404777037661815, 'ent_coef': 0.00093977018380121}. Best is trial 0 with value: 214.95445947945117.[0m


Track generation: 1065..1340 -> 275-tiles track
Track generation: 1269..1590 -> 321-tiles track
Track generation: 1241..1555 -> 314-tiles track
Track generation: 1218..1536 -> 318-tiles track
Track generation: 1291..1618 -> 327-tiles track
Track generation: 1214..1522 -> 308-tiles track
Track generation: 1166..1470 -> 304-tiles track
Track generation: 1093..1374 -> 281-tiles track
Track generation: 1119..1407 -> 288-tiles track
Track generation: 1316..1649 -> 333-tiles track
Track generation: 1199..1503 -> 304-tiles track
Track generation: 1046..1311 -> 265-tiles track
Track generation: 1093..1377 -> 284-tiles track
Track generation: 960..1211 -> 251-tiles track
Track generation: 1249..1564 -> 315-tiles track
Track generation: 1014..1275 -> 261-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1072..1344 -> 272-tiles track
Track generation: 1083..1358 -> 275-tiles track
Track generation: 1059..1328 -> 269-tiles track
Track gen

[32m[I 2023-03-26 23:07:18,651][0m Trial 1 finished with value: -83.40715942084789 and parameters: {'learning_rate': 0.003586300401704068, 'gamma': 0.94676963997145, 'clip_range': 0.2026895856704498, 'ent_coef': 0.0004117155937922256}. Best is trial 0 with value: 214.95445947945117.[0m


Track generation: 1216..1524 -> 308-tiles track
Track generation: 1171..1468 -> 297-tiles track
Track generation: 1176..1473 -> 297-tiles track
Track generation: 1216..1524 -> 308-tiles track
Track generation: 1136..1424 -> 288-tiles track
Track generation: 1142..1434 -> 292-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1357..1701 -> 344-tiles track
Track generation: 1188..1489 -> 301-tiles track
Track generation: 1180..1479 -> 299-tiles track
Track generation: 1171..1468 -> 297-tiles track
Track generation: 1031..1298 -> 267-tiles track
Track generation: 1132..1419 -> 287-tiles track
Track generation: 1216..1524 -> 308-tiles track
Track generation: 1249..1560 -> 311-tiles track
Track generation: 1195..1498 -> 303-tiles track
Track generation: 1040..1304 -> 264-tiles track
Track generation: 1200..1504 -> 304-tiles track
Track generation: 1292..1620 -> 328-tiles track
Track generation: 1036..1304 -> 268-tiles track
Track ge

[32m[I 2023-03-26 23:42:38,421][0m Trial 2 finished with value: -28.976403856277464 and parameters: {'learning_rate': 0.0007865703351535196, 'gamma': 0.9423846454364294, 'clip_range': 0.2660542470424372, 'ent_coef': 0.00040769685854966065}. Best is trial 0 with value: 214.95445947945117.[0m


Track generation: 1115..1398 -> 283-tiles track
Track generation: 1215..1523 -> 308-tiles track
Track generation: 1108..1389 -> 281-tiles track
Track generation: 1212..1519 -> 307-tiles track
Track generation: 1251..1568 -> 317-tiles track
Track generation: 1096..1374 -> 278-tiles track
Track generation: 1111..1395 -> 284-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1200..1504 -> 304-tiles track
Track generation: 1269..1591 -> 322-tiles track
Track generation: 1294..1630 -> 336-tiles track
Track generation: 1036..1299 -> 263-tiles track
Track generation: 1180..1479 -> 299-tiles track
Track generation: 1118..1402 -> 284-tiles track
Track generation: 1077..1353 -> 276-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1077..1350 -> 273-tiles track
Track generation: 1103..1383 -> 280-tiles track
Track generation: 1128..1414 -> 286-tiles track
Track generation: 1123..

[32m[I 2023-03-27 00:17:52,802][0m Trial 3 finished with value: -63.87538165748119 and parameters: {'learning_rate': 0.008017769831211642, 'gamma': 0.9845737033260175, 'clip_range': 0.106339453771881, 'ent_coef': 0.00046629858224460504}. Best is trial 0 with value: 214.95445947945117.[0m


Track generation: 1211..1526 -> 315-tiles track
Track generation: 1011..1268 -> 257-tiles track
Track generation: 962..1212 -> 250-tiles track
Track generation: 1224..1587 -> 363-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1107..1388 -> 281-tiles track
Track generation: 1161..1455 -> 294-tiles track
Track generation: 1114..1397 -> 283-tiles track
Track generation: 1224..1534 -> 310-tiles track
Track generation: 1191..1493 -> 302-tiles track
Track generation: 1244..1559 -> 315-tiles track
Track generation: 1021..1280 -> 259-tiles track
Track generation: 1088..1364 -> 276-tiles track
Track generation: 1099..1378 -> 279-tiles track
Track generation: 1068..1339 -> 271-tiles track
Track generation: 1152..1444 -> 292-tiles track
Track generation: 1116..1401 -> 285-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1097..1375 -> 278-tiles track
Track generation: 1336..1

[32m[I 2023-03-27 00:53:13,943][0m Trial 4 finished with value: -93.14742199033499 and parameters: {'learning_rate': 0.005236764205578114, 'gamma': 0.9852583991058544, 'clip_range': 0.3909804801774728, 'ent_coef': 0.0008823998499287554}. Best is trial 0 with value: 214.95445947945117.[0m


Track generation: 1201..1505 -> 304-tiles track
Track generation: 1257..1575 -> 318-tiles track
Track generation: 1256..1574 -> 318-tiles track
Track generation: 1164..1459 -> 295-tiles track
Track generation: 1047..1313 -> 266-tiles track
Track generation: 1332..1669 -> 337-tiles track
Track generation: 1271..1615 -> 344-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1072..1344 -> 272-tiles track
Track generation: 1157..1451 -> 294-tiles track
Track generation: 1058..1327 -> 269-tiles track
Track generation: 1033..1295 -> 262-tiles track
Track generation: 1004..1259 -> 255-tiles track
Track generation: 1219..1528 -> 309-tiles track
Track generation: 1237..1550 -> 313-tiles track
Track generation: 1050..1316 -> 266-tiles track
Track generation: 1240..1554 -> 314-tiles track
Track generation: 1124..1409 -> 285-tiles track
Track generation: 1174..1471 -> 297-tiles track
Track generation: 1192..1494 -> 302-tiles track
Track ge

[32m[I 2023-03-27 01:28:34,962][0m Trial 5 finished with value: -43.558276088535784 and parameters: {'learning_rate': 0.00528242433671561, 'gamma': 0.9108121448931884, 'clip_range': 0.26932786080287485, 'ent_coef': 0.00011312388350375221}. Best is trial 0 with value: 214.95445947945117.[0m


Track generation: 1267..1588 -> 321-tiles track
Track generation: 1244..1568 -> 324-tiles track
Track generation: 1196..1499 -> 303-tiles track
Track generation: 1187..1488 -> 301-tiles track
Track generation: 1136..1424 -> 288-tiles track
Track generation: 918..1156 -> 238-tiles track
Track generation: 1178..1476 -> 298-tiles track
Track generation: 1246..1561 -> 315-tiles track
Track generation: 1223..1532 -> 309-tiles track
Track generation: 1089..1369 -> 280-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1240..1554 -> 314-tiles track
Track generation: 1212..1519 -> 307-tiles track
Track generation: 1032..1294 -> 262-tiles track
Track generation: 1028..1288 -> 260-tiles track
Track generation: 1256..1574 -> 318-tiles track
Track generation: 1140..1429 -> 289-tiles track
Track generation: 1110..1396 -> 286-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1124..1

[32m[I 2023-03-27 02:04:04,098][0m Trial 6 finished with value: -77.43237508237362 and parameters: {'learning_rate': 0.004710801215057911, 'gamma': 0.9468362740097578, 'clip_range': 0.3301917261349552, 'ent_coef': 0.00016662726308886702}. Best is trial 0 with value: 214.95445947945117.[0m


Track generation: 1136..1423 -> 287-tiles track
Track generation: 1115..1398 -> 283-tiles track
Track generation: 1124..1414 -> 290-tiles track
Track generation: 1160..1453 -> 293-tiles track
Track generation: 1140..1429 -> 289-tiles track
Track generation: 1190..1492 -> 302-tiles track
Track generation: 1091..1368 -> 277-tiles track
Track generation: 1093..1371 -> 278-tiles track
Track generation: 1025..1293 -> 268-tiles track
Track generation: 1072..1344 -> 272-tiles track
Track generation: 1087..1363 -> 276-tiles track
Track generation: 1129..1423 -> 294-tiles track
Track generation: 1235..1548 -> 313-tiles track
Track generation: 1246..1565 -> 319-tiles track
Track generation: 1231..1543 -> 312-tiles track
Track generation: 1071..1343 -> 272-tiles track
Track generation: 1129..1416 -> 287-tiles track
Track generation: 1328..1664 -> 336-tiles track
Track generation: 1191..1493 -> 302-tiles track
Track generation: 1086..1361 -> 275-tiles track
Track generation: 1233..1553 -> 320-tile

[32m[I 2023-03-27 02:39:12,026][0m Trial 7 finished with value: -81.57178148776293 and parameters: {'learning_rate': 0.009478483100357586, 'gamma': 0.9888393195375208, 'clip_range': 0.17900031834672514, 'ent_coef': 0.00040865691008535183}. Best is trial 0 with value: 214.95445947945117.[0m


Track generation: 957..1203 -> 246-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1099..1378 -> 279-tiles track
Track generation: 1096..1374 -> 278-tiles track
Track generation: 1403..1758 -> 355-tiles track
Track generation: 979..1236 -> 257-tiles track
Track generation: 1048..1314 -> 266-tiles track
Track generation: 1173..1471 -> 298-tiles track
Track generation: 1256..1574 -> 318-tiles track
Track generation: 1279..1603 -> 324-tiles track
Track generation: 1164..1466 -> 302-tiles track
Track generation: 1239..1553 -> 314-tiles track
Track generation: 1076..1349 -> 273-tiles track
Track generation: 1011..1268 -> 257-tiles track
Track generation: 1251..1568 -> 317-tiles track
Track generation: 1100..1379 -> 279-tiles track
Track generation: 1202..1517 -> 315-tiles track
Track generation: 1131..1417 -> 286-tiles track
Track generation: 1114..1402 -> 288-tiles track
Track generation: 1130..1417 -> 287-tiles track
Track gene

[32m[I 2023-03-27 03:14:19,298][0m Trial 8 finished with value: -78.32718561589718 and parameters: {'learning_rate': 0.0008635505048026578, 'gamma': 0.952350965064394, 'clip_range': 0.34604871041058566, 'ent_coef': 0.0003751706057335957}. Best is trial 0 with value: 214.95445947945117.[0m


Track generation: 1110..1392 -> 282-tiles track
Track generation: 1267..1588 -> 321-tiles track
Track generation: 1224..1541 -> 317-tiles track
Track generation: 1257..1582 -> 325-tiles track
Track generation: 1044..1309 -> 265-tiles track
Track generation: 1200..1504 -> 304-tiles track
Track generation: 1188..1489 -> 301-tiles track
Track generation: 1178..1476 -> 298-tiles track
Track generation: 1104..1392 -> 288-tiles track
Track generation: 1116..1399 -> 283-tiles track
Track generation: 946..1195 -> 249-tiles track
Track generation: 1102..1382 -> 280-tiles track
Track generation: 1163..1463 -> 300-tiles track
Track generation: 1104..1384 -> 280-tiles track
Track generation: 1084..1359 -> 275-tiles track
Track generation: 1307..1638 -> 331-tiles track
Track generation: 1176..1474 -> 298-tiles track
Track generation: 1244..1559 -> 315-tiles track
Track generation: 1223..1533 -> 310-tiles track
Track generation: 1144..1434 -> 290-tiles track
Track generation: 1099..1378 -> 279-tiles

[32m[I 2023-03-27 03:49:28,790][0m Trial 9 finished with value: -93.14844299405813 and parameters: {'learning_rate': 0.0063672542735348685, 'gamma': 0.9805736094377064, 'clip_range': 0.3320416995195945, 'ent_coef': 0.0006979690480648206}. Best is trial 0 with value: 214.95445947945117.[0m


Track generation: 1108..1389 -> 281-tiles track
Meilleur score: 214.95445947945117
Meilleur hyperparameters: {'clip_range': 0.21404777037661815, 'ent_coef': 0.00093977018380121, 'gamma': 0.9183776997781369, 'learning_rate': 0.00039317898096148755}


Loading best model to test

In [None]:
load_model_path = f'models/PPO_0.0018136080950905144_0.9642267575906099__0.11740309789358286_0.007042102315082725__model'
model = PPO.load(load_model_path, env=env)

In [None]:
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=5, render=True)

Creation d'un dernier model à 1M de timesteps

In [14]:
environment_name = 'CarRacing-v0'
env = gym.make(environment_name)
env = DummyVecEnv([lambda: env])

model = PPO("CnnPolicy", env, verbose=1, device='mps')

timesteps = 1000000
model.learn(total_timesteps=timesteps)

model_path = f'models/PPO_1M_base__model'
model.save(model_path)

mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=5, render=True)
print('Récompense moyenne :', mean_reward, '||| Écart type :', std_reward)

Using mps device
Wrapping the env in a VecTransposeImage.
Track generation: 1013..1275 -> 262-tiles track
Track generation: 931..1168 -> 237-tiles track
Track generation: 1101..1386 -> 285-tiles track
-----------------------------
| time/              |      |
|    fps             | 158  |
|    iterations      | 1    |
|    time_elapsed    | 12   |
|    total_timesteps | 2048 |
-----------------------------
Track generation: 1191..1501 -> 310-tiles track
Track generation: 1179..1478 -> 299-tiles track
-----------------------------------------
| time/                   |             |
|    fps                  | 120         |
|    iterations           | 2           |
|    time_elapsed         | 33          |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.008287225 |
|    clip_fraction        | 0.114       |
|    clip_range           | 0.2         |
|    entropy_loss         | -4.25       |
|    explained_variance   | 0.0