# Environment
Track will be generated at random.

* Observation space:- 96x96 pixels
* Action space:- If continuous: There are 3 actions: steering (-1 is full left, +1 is full right), gas, and breaking. If discrete: There are 5 actions: do nothing, steer left, steer right, gas, brake.
* Rewards:- The reward is -0.1 every frame and +1000/N for every track tile visited, where N is the total number of tiles visited in the track. For example, if you have finished in 732 frames, your reward is 1000 - 0.1*732 = 926.8 points.
* Episode termination:- The episode finishes when all of the tiles are visited. The car can also go outside of the playfield - that is, far off the track, in which case it will receive -100 reward and die.

In [1]:
import gym
from stable_baselines3.ppo import PPO
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.ppo.policies import MlpPolicy
from stable_baselines3.common.evaluation import evaluate_policy
import numpy as np

In [11]:
# Get the environment
env = gym.make('CarRacing-v0')

# Wrap with monitor
env = Monitor(env)

In [12]:
# Get the model
model = PPO(MlpPolicy, env, verbose=0, device='cuda')

In [13]:
# Define the evaluation function
def eval_model(mdl, env_in, num_eval=50):
    eps_rewards, eps_lengths = evaluate_policy(mdl, env=env_in, n_eval_episodes=num_eval, return_episode_rewards=True)
    print('Mean reward', np.mean(eps_rewards))
    print('Std reward', np.std(eps_rewards))
    print('Mean episode length', np.mean(eps_lengths))

In [14]:
# Train the model
model.learn(total_timesteps=1e5)

<stable_baselines3.ppo.ppo.PPO at 0x7f80b8ff9f90>

In [15]:
# Evaluate the model
eval_model(model, env)

Mean reward 99.65782004
Std reward 132.8614795805479
Mean episode length 1000.0
