Import

In [5]:
import gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy
import optuna

Creation de l'environnement

In [6]:
environment_name = 'CarRacing-v0'
env = gym.make(environment_name)
env.seed(42)
print('Action space :', env.action_space, '||| Observation space (shape) :', env.observation_space.shape)

Action space : Box([-1.  0.  0.], [1. 1. 1.], (3,), float32) ||| Observation space (shape) : (96, 96, 3)


1er test avec un agent aléatoire

In [7]:
episodes = 2
for episode in range(1, episodes+1):
    state = env.reset()
    done = False
    score = 0 
    
    while not done:
        env.render()
        action = env.action_space.sample()
        n_state, reward, done, info = env.step(action)
        score+=reward
    print(f'Episode:{episode} Score:{score}')
env.close()

Track generation: 1208..1514 -> 306-tiles track
Episode:1 Score:-34.42622950819724
Track generation: 1203..1508 -> 305-tiles track
Episode:2 Score:-34.2105263157899


### Creation de notre 1er model de test :

Creation de notre environnement DummyVecEnv à entrainer

Avec ajout de `env.seed(42)` afin de garder les memes parcours

In [9]:
env = gym.make(environment_name)
env = DummyVecEnv([lambda: env])
env.seed(42)

[[42]]

Declaration du model de PPO avec stable baseline

In [10]:
model = PPO("CnnPolicy", env, verbose=1, device='mps')

Using mps device
Wrapping the env in a VecTransposeImage.


In [11]:
timesteps = 200000
model.learn(total_timesteps=timesteps)

Track generation: 1208..1514 -> 306-tiles track
Track generation: 1203..1508 -> 305-tiles track
Track generation: 1127..1413 -> 286-tiles track
-----------------------------
| time/              |      |
|    fps             | 134  |
|    iterations      | 1    |
|    time_elapsed    | 15   |
|    total_timesteps | 2048 |
-----------------------------
Track generation: 1237..1550 -> 313-tiles track
Track generation: 1129..1416 -> 287-tiles track
-----------------------------------------
| time/                   |             |
|    fps                  | 106         |
|    iterations           | 2           |
|    time_elapsed         | 38          |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.010080028 |
|    clip_fraction        | 0.088       |
|    clip_range           | 0.2         |
|    entropy_loss         | -4.26       |
|    explained_variance   | -0.00309    |
|    learning_rate        | 0.0003      |
|   

<stable_baselines3.ppo.ppo.PPO at 0x2bcc4cd00>

On sauvegarde notre model avec son timestep et ses particularités

In [12]:
model_path = f'models/PPO_200k_base_seed42__model'
model.save(model_path)

On charge le model (si on veut selectionner le meilleur par exemple)

In [13]:
load_model_path = f'models/PPO_200k_base_seed42__model'
model = PPO.load(load_model_path, env=env)

Wrapping the env in a VecTransposeImage.


On evalue le model avec 2 mesures : 
 - La récompense moyenne
 - L'écart type

Une *récompense moyenne élevée* et un *écart type faible* sont signe d'une politique **performante et stable**

Tandis qu'une politique qui obtient une *récompense moyenne faible* et un *écart type élevé* est considérée comme **peu performante et instable**.

In [14]:
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=5, render=True)

Track generation: 899..1133 -> 234-tiles track




Track generation: 1040..1304 -> 264-tiles track
Track generation: 1179..1478 -> 299-tiles track
Track generation: 1039..1305 -> 266-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1160..1454 -> 294-tiles track
Track generation: 1074..1347 -> 273-tiles track
Track generation: 1229..1546 -> 317-tiles track


In [15]:
print('Récompense moyenne :', mean_reward, '||| Écart type :', std_reward)

Récompense moyenne : 569.3619286179543 ||| Écart type : 269.2836495382977


## Optuna

Recherche d'optimisation des parametres avec Optuna :

(Derniere MAJ qui garde aussi un environnement similaire avec `env.seed(42)`)

In [16]:
study_name = "study_200k_seed42"
timesteps = 200000

In [17]:
def objective(trial):

    # Define hyperparameters
    learning_rate = trial.suggest_float('learning_rate', 1e-4, 1e-2)
    gamma = trial.suggest_float('gamma', 0.9, 0.99)
    clip_range = trial.suggest_float('clip_range', 0.1, 0.4)
    ent_coef = trial.suggest_float('ent_coef', 1e-4, 1e-3)

    environment_name = 'CarRacing-v0'
    env = gym.make(environment_name)
    env = DummyVecEnv([lambda: env])
    env.seed(42)

    model = PPO("CnnPolicy", env, verbose=0, 
                learning_rate=learning_rate, gamma=gamma,
                clip_range=clip_range, ent_coef=ent_coef, device="mps")

    # Train model
    model.learn(total_timesteps=timesteps)

    # Save model
    model_path = f'models/{study_name}/PPO_{learning_rate}_{gamma}__{clip_range}_{ent_coef}__model'
    model.save(model_path)

    # Load model
    model = PPO.load(model_path, env=env)

    # Evaluate model
    mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=5)

    # Close environment
    env.close()

    # Return mean reward to maximize
    return mean_reward

In [18]:
study = optuna.create_study(storage="sqlite:///optunaRLCRstudy.db", study_name=study_name, direction='maximize')

study.optimize(objective, n_trials=5)

print(f"Meilleur score: {study.best_value}")
print(f"Meilleur hyperparameters: {study.best_params}")

[32m[I 2023-03-28 11:28:17,083][0m A new study created in RDB with name: study_200k_seed42[0m


Track generation: 1208..1514 -> 306-tiles track
Track generation: 1203..1508 -> 305-tiles track
Track generation: 1127..1413 -> 286-tiles track
Track generation: 1237..1550 -> 313-tiles track
Track generation: 1129..1416 -> 287-tiles track
Track generation: 1177..1475 -> 298-tiles track
Track generation: 1048..1318 -> 270-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1236..1549 -> 313-tiles track
Track generation: 1243..1558 -> 315-tiles track
Track generation: 1131..1418 -> 287-tiles track
Track generation: 1157..1451 -> 294-tiles track
Track generation: 1147..1438 -> 291-tiles track
Track generation: 1037..1307 -> 270-tiles track
Track generation: 1011..1271 -> 260-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1203..1508 -> 305-tiles track
Track generation: 1059..1328 -> 269-tiles track
Track generation: 1163..1458 -> 295-tiles track
Track generation: 963..1



Track generation: 899..1133 -> 234-tiles track
Track generation: 1040..1304 -> 264-tiles track
Track generation: 1179..1478 -> 299-tiles track
Track generation: 1039..1305 -> 266-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1160..1454 -> 294-tiles track
Track generation: 1074..1347 -> 273-tiles track


[32m[I 2023-03-28 12:03:32,282][0m Trial 0 finished with value: 764.4983791157604 and parameters: {'learning_rate': 0.0004404439563695323, 'gamma': 0.9391774113519196, 'clip_range': 0.141118008074816, 'ent_coef': 0.0005216509661898132}. Best is trial 0 with value: 764.4983791157604.[0m


Track generation: 1229..1546 -> 317-tiles track
Track generation: 1208..1514 -> 306-tiles track
Track generation: 1203..1508 -> 305-tiles track
Track generation: 1127..1413 -> 286-tiles track
Track generation: 1237..1550 -> 313-tiles track
Track generation: 1129..1416 -> 287-tiles track
Track generation: 1177..1475 -> 298-tiles track
Track generation: 1048..1318 -> 270-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1236..1549 -> 313-tiles track
Track generation: 1243..1558 -> 315-tiles track
Track generation: 1131..1418 -> 287-tiles track
Track generation: 1157..1451 -> 294-tiles track
Track generation: 1147..1438 -> 291-tiles track
Track generation: 1037..1307 -> 270-tiles track
Track generation: 1011..1271 -> 260-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1203..1508 -> 305-tiles track
Track generation: 1059..1328 -> 269-tiles track
Track generation: 1163..

[32m[I 2023-03-28 12:37:30,406][0m Trial 1 finished with value: -92.64145351499319 and parameters: {'learning_rate': 0.006833748774518209, 'gamma': 0.96803210963415, 'clip_range': 0.14331628805385008, 'ent_coef': 0.0005407070298176535}. Best is trial 0 with value: 764.4983791157604.[0m


Track generation: 1074..1347 -> 273-tiles track
Track generation: 1208..1514 -> 306-tiles track
Track generation: 1203..1508 -> 305-tiles track
Track generation: 1127..1413 -> 286-tiles track
Track generation: 1237..1550 -> 313-tiles track
Track generation: 1129..1416 -> 287-tiles track
Track generation: 1177..1475 -> 298-tiles track
Track generation: 1048..1318 -> 270-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1236..1549 -> 313-tiles track
Track generation: 1243..1558 -> 315-tiles track
Track generation: 1131..1418 -> 287-tiles track
Track generation: 1157..1451 -> 294-tiles track
Track generation: 1147..1438 -> 291-tiles track
Track generation: 1037..1307 -> 270-tiles track
Track generation: 1011..1271 -> 260-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1203..1508 -> 305-tiles track
Track generation: 1059..1328 -> 269-tiles track
Track generation: 1163..

[32m[I 2023-03-28 13:11:39,404][0m Trial 2 finished with value: -80.92981071025133 and parameters: {'learning_rate': 0.006383973566712401, 'gamma': 0.9624859458903102, 'clip_range': 0.2763684851066367, 'ent_coef': 0.0005370609394924749}. Best is trial 0 with value: 764.4983791157604.[0m


Track generation: 1226..1537 -> 311-tiles track
Track generation: 1208..1514 -> 306-tiles track
Track generation: 1203..1508 -> 305-tiles track
Track generation: 1127..1413 -> 286-tiles track
Track generation: 1237..1550 -> 313-tiles track
Track generation: 1129..1416 -> 287-tiles track
Track generation: 1177..1475 -> 298-tiles track
Track generation: 1048..1318 -> 270-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1236..1549 -> 313-tiles track
Track generation: 1243..1558 -> 315-tiles track
Track generation: 1131..1418 -> 287-tiles track
Track generation: 1157..1451 -> 294-tiles track
Track generation: 1147..1438 -> 291-tiles track
Track generation: 1037..1307 -> 270-tiles track
Track generation: 1011..1271 -> 260-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1203..1508 -> 305-tiles track
Track generation: 1059..1328 -> 269-tiles track
Track generation: 1163..

[32m[I 2023-03-28 13:45:57,131][0m Trial 3 finished with value: -86.63355255872011 and parameters: {'learning_rate': 0.009811487641228668, 'gamma': 0.9434657976600677, 'clip_range': 0.1943594733142664, 'ent_coef': 0.0008595743572266645}. Best is trial 0 with value: 764.4983791157604.[0m


Track generation: 1249..1564 -> 315-tiles track
Track generation: 1208..1514 -> 306-tiles track
Track generation: 1203..1508 -> 305-tiles track
Track generation: 1127..1413 -> 286-tiles track
Track generation: 1237..1550 -> 313-tiles track
Track generation: 1129..1416 -> 287-tiles track
Track generation: 1177..1475 -> 298-tiles track
Track generation: 1048..1318 -> 270-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1236..1549 -> 313-tiles track
Track generation: 1243..1558 -> 315-tiles track
Track generation: 1131..1418 -> 287-tiles track
Track generation: 1157..1451 -> 294-tiles track
Track generation: 1147..1438 -> 291-tiles track
Track generation: 1037..1307 -> 270-tiles track
Track generation: 1011..1271 -> 260-tiles track
retry to generate track (normal if there are not manyinstances of this message)
Track generation: 1203..1508 -> 305-tiles track
Track generation: 1059..1328 -> 269-tiles track
Track generation: 1163..

Loading best model to test

In [None]:
load_model_path = f'models/study_200k_seed42/PPO_0.0004404439563695323_0.9391774113519196__0.141118008074816_0.0005216509661898132__model'
model = PPO.load(load_model_path, env=env)

In [None]:
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=5, render=True)
print('Récompense moyenne :', mean_reward, '||| Écart type :', std_reward)

Creation d'un model base à 1M de timesteps

In [14]:
environment_name = 'CarRacing-v0'
env = gym.make(environment_name)
env = DummyVecEnv([lambda: env])

model = PPO("CnnPolicy", env, verbose=1, device='mps')

timesteps = 1000000
model.learn(total_timesteps=timesteps)

model_path = f'models/PPO_1M_base__model'
model.save(model_path)

mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=5, render=True)
print('Récompense moyenne :', mean_reward, '||| Écart type :', std_reward)

Using mps device
Wrapping the env in a VecTransposeImage.
Track generation: 1013..1275 -> 262-tiles track
Track generation: 931..1168 -> 237-tiles track
Track generation: 1101..1386 -> 285-tiles track
-----------------------------
| time/              |      |
|    fps             | 158  |
|    iterations      | 1    |
|    time_elapsed    | 12   |
|    total_timesteps | 2048 |
-----------------------------
Track generation: 1191..1501 -> 310-tiles track
Track generation: 1179..1478 -> 299-tiles track
-----------------------------------------
| time/                   |             |
|    fps                  | 120         |
|    iterations           | 2           |
|    time_elapsed         | 33          |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.008287225 |
|    clip_fraction        | 0.114       |
|    clip_range           | 0.2         |
|    entropy_loss         | -4.25       |
|    explained_variance   | 0.0

Creation d'un model à 500k timesteps avec les meilleurs hyperparametres trouvés par Optuna

In [2]:
environment_name = 'CarRacing-v0'
env = gym.make(environment_name)
env = DummyVecEnv([lambda: env])

model = PPO("CnnPolicy", env, verbose=1, 
            clip_range=0.21404777037661815, ent_coef=0.00093977018380121, 
            gamma=0.9183776997781369, learning_rate=0.00039317898096148755, 
            device='mps')

timesteps = 500000
model.learn(total_timesteps=timesteps)

model_path = f'models/PPO_500k_best_optuna__model'
model.save(model_path)

mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=5, render=True)
print('Récompense moyenne :', mean_reward, '||| Écart type :', std_reward)

Using mps device
Wrapping the env in a VecTransposeImage.
Track generation: 1107..1388 -> 281-tiles track


2023-03-27 15:48:19.824 Python[33733:5970049] ApplePersistenceIgnoreState: Existing state will not be touched. New state will be written to /var/folders/y0/7mmq5kp12fq249dxyngjmwv80000gn/T/org.python.python.savedState


Track generation: 1060..1329 -> 269-tiles track
Track generation: 1211..1518 -> 307-tiles track
-----------------------------
| time/              |      |
|    fps             | 126  |
|    iterations      | 1    |
|    time_elapsed    | 16   |
|    total_timesteps | 2048 |
-----------------------------
Track generation: 1241..1555 -> 314-tiles track
Track generation: 1131..1418 -> 287-tiles track
-----------------------------------------
| time/                   |             |
|    fps                  | 102         |
|    iterations           | 2           |
|    time_elapsed         | 39          |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.008780808 |
|    clip_fraction        | 0.103       |
|    clip_range           | 0.214       |
|    entropy_loss         | -4.27       |
|    explained_variance   | -0.0049     |
|    learning_rate        | 0.000393    |
|    loss                 | 0.111       |
|    n_upd

KeyboardInterrupt: 

In [4]:
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=5, render=True)
print('Récompense moyenne :', mean_reward, '||| Écart type :', std_reward)



Track generation: 1298..1636 -> 338-tiles track
Track generation: 1180..1479 -> 299-tiles track
Track generation: 1135..1423 -> 288-tiles track
Track generation: 1327..1663 -> 336-tiles track
Track generation: 1107..1388 -> 281-tiles track
Track generation: 1179..1486 -> 307-tiles track
Récompense moyenne : 253.76776015609502 ||| Écart type : 132.00501921149507
