Синхронный A2C.
Используется объект "менеджер" (manager) который создает несколько объектов "рабочих" (workers), код которых работает последовательно в одном потоке.
- Каждый рабочий имеет собственный environment и экземпляр сети, менеджер владеет целевой сетью.
- Каждый рабочий имеет собственный счетчик эпизодов и функцию, которая генерирует траекторию. 
- Функция возвращает результаты по траекториям пока не будет завершено заданное при создании рабочего количество эпизодов. Счетчик стейтов, - через переменные объекта рабочего.
- После завершения n-step траектории или эпизода рабочий возвращает менеджеру все необходимые данные для расчета loss-а.
- Менеджер последовательно вызывает функции рабочих и обрабатывает возвращаемые траектории, пока хоть один из рабочих что-то возвращает.
- Получив траекторию от рабочего, менеджер рассчитывает loss, копирует в целевую сеть накопленные градиенты рабочего, выполняет backward pass и обновляет веса собственной (целевой) сети. Затем градиенты обнуляются, веса целевой сети копируются в сеть текущего рабочего.
- На каждом цикле перебора менеджер пропускает случайного рабочего, чтобы не было синхронизации, так как длина эпизодов обычно одинаковая - максимальная.
- сеть (рабочего) оценивается на каждом лучшем эпизоде, сохраняется видео и чекпойнт. Лучшим считается эпизод с нагадой которая более чем на 50 боллов выше предыдущего глобального лучшего.



Класс для модели 

In [131]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torch.distributions as distributions
from torch.utils.tensorboard import SummaryWriter

import numpy as np
import gymnasium as gym

from collections import deque
import cv2

import os

from gymnasium.wrappers.monitoring.video_recorder import VideoRecorder
from IPython.display import HTML
from base64 import b64encode
from pyvirtualdisplay import Display


Класс модели 

In [132]:
class CarRacing_CNN_Model(nn.Module):   
            
    def __init__(self, fstack_size):
        super(CarRacing_CNN_Model, self).__init__()        
        
        self.gru_hidden_state = None 
        # Общий feature extractor
        # В сеть передаются обесцвеченные (grayscale) кадры размером 84x84. 
        # Количество кадров в последовательности - fstack_size 
        # Кадры последовательности интерпретируются как каналы на входе cnn
        self.cnn = nn.Sequential(
            nn.BatchNorm2d(fstack_size),
            nn.Conv2d(4, 32, 4, stride=2, padding=1),
            nn.LeakyReLU(0.2), 
            nn.Conv2d(32, 32, 3, stride=2, padding=1),
            nn.LeakyReLU(0.2), 
            nn.Conv2d(32, 32, 3, stride=2, padding=1),
            nn.LeakyReLU(0.2), 
            nn.Conv2d(32, 32, 3, stride=2, padding=1),
            nn.LeakyReLU(0.2), 
            nn.BatchNorm2d(32),
            nn.Flatten()          
        )
        
        cnn_out_size = int(np.prod(self.cnn(torch.zeros(1, fstack_size, 84, 84)).size()))
               
        #Инициализация FE
        self.cnn.apply(self.init_cnn_weights)        
        
        self.gru = nn.GRU(cnn_out_size, 256, 1, batch_first=True) # Вместо LSTM из оригинальной статьи  
        
        
        #голова Actor (mean - [-1 : 1])
        self.mu = nn.Sequential(
            nn.Linear(256, 3),
            nn.Tanh()
        )
        #голова Actor (std dev - [0 : 1])        
        self.sigma = nn.Sequential(
            nn.Linear(256, 3),
            nn.Softplus() 
        )  
        
        # голова Critic
        self.value = nn.Linear(256, 1)
        
        
        
            
                                  
    #-------------------------------------------------------------------------------------------------------------------------
    
    def init_cnn_weights(self, m):            
            if type(m) == nn.Conv2d:
                nn.init.xavier_uniform_(m.weight, gain=nn.init.calculate_gain('leaky_relu'))
                nn.init.constant_(m.bias, 0.01)    
                            
    #-------------------------------------------------------------------------------------------------------------------------
    
    def reset_gru(self):
        if self.gru_hidden_state != None:
            self.gru_hidden_state = None
                            
    #-------------------------------------------------------------------------------------------------------------------------
             
    def forward(self, state):
        cnn_out=self.cnn(state) 

        if self.gru_hidden_state == None:
            gru_out, hs = self.gru(cnn_out)
        else:    
            gru_out, hs = self.gru(cnn_out, self.gru_hidden_state) 

        self.gru_hidden_state = hs.detach()
         
        mu = self.mu(gru_out)
        sigma = self.sigma(gru_out)
        value = self.value(gru_out)
        return mu, sigma, value

Класс рабочего

In [133]:
class Worker:
    def __init__(self, id, target, num_episodes, fstack_size, device):
        
        self.device = device
        self.model = CarRacing_CNN_Model(fstack_size).to(self.device)
        self.model.load_state_dict(target.state_dict())
        self.episodes_left_to_do = num_episodes        
        self.__is_alive = True
        self.episode_steps = []
        self.episode_reward = 0
        self.episode_rewards = []
        self.best_ep_reward = -10000        

        self.id = id
        self.framestack = deque(maxlen=fstack_size)
        self.steps_done = 0
        self.last_steps_rewards = deque(maxlen=100)
        
        self.env = gym.make("CarRacing-v2", render_mode="rgb_array", lap_complete_percent=0.95, domain_randomize=False)   
        print(f'Worker {self.id} is ready')
        self.new_episode()  
    
    #-------------------------------------------------------------------------------------------------------------------------
         
    def revive(self):  
        if self.episodes_left_to_do > 0 and not self.__is_alive:
            self.__is_alive = True
            return 1
        else:
            return 0
    
    #-------------------------------------------------------------------------------------------------------------------------
        
    def kill(self):  
        if self.__is_alive:
            self.__is_alive = False        
            return -1
        else:
            return 0
    
    #-------------------------------------------------------------------------------------------------------------------------
    
    def is_alive(self):
        return self.__is_alive
    
    #-------------------------------------------------------------------------------------------------------------------------
    
    def preprocess_state(self, image):
        image=image[:84, 6:90] #обрезаем лишнее        
        image = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)  # удаляем цвет        
        image = image / 255.0 # нормализуем
        return image  
        
    #-------------------------------------------------------------------------------------------------------------------------
    
    def preprocess(self, states):

        return torch.stack([torch.from_numpy(self.preprocess_state(image_data)) for image_data in states]).float()
                           
    #----------------------------------------------------------------------------------------------------------------------
        
    def predict (self, model, states, greedy = False):
        
        t_framestack= self.preprocess(states).to(self.device)        
        t_framestack = t_framestack.unsqueeze(0) # (+)batch_dim, 
        
        # Получаем предсказание от сети , 
        mu_t, sigma_t, value_t = model(t_framestack)        
        mu_t = mu_t.squeeze() #(-)batch_dim        
        sigma_t = sigma_t.squeeze() #(-)batch_dim        
                
        action_dist = distributions.normal.Normal(mu_t, sigma_t)
        if greedy:
            action = mu_t
            log_prob_t = None
            entropy = None
            value_t = None
        else:
            action = action_dist.sample()
            log_prob_t = action_dist.log_prob(action)
            entropy=action_dist.entropy()
            value_t=value_t.squeeze(0)
        
        # Подрезаем значения действий
        action_t = torch.clamp(action, torch.tensor([-1.0, 0.0, 0.0]).to(self.device), torch.tensor([1.0, 1.0, 1.0]).to(self.device))
        return log_prob_t, entropy, action_t, value_t
    
    #----------------------------------------------------------------------------------------------------------------------
         
    def new_episode(self):
        self.env.reset()
        self.framestack.clear()
        self.model.reset_gru()
        self.episode_reward = 0
        # Пропускаем наезд камеры в начале эпизода
        action_none = np.array([0,0,0])
        for _ in range(50):            
            state, _, _, _, _ = self.env.step(np.array(action_none))
            self.framestack.append(state) 

                          
        
    #----------------------------------------------------------------------------------------------------------------------
        
    def get_n_step_data(self, trajectory_len, gamma):
        
        step=0
        log_probs_t = torch.zeros (0,3).to(self.device)
        entropies_t = torch.zeros (0,3).to(self.device)
        values_t = torch.zeros (0).to(self.device)
        self.rewards = []
        terminal = None
        return_ep_reward = None
        if self.is_alive:
            while True: 
                self.steps_done +=1            
                step +=1
                log_prob_t, entropy, action_t, value_t = self.predict(self.model, self.framestack)
            
                state, reward, terminated, truncated, _ = self.env.step(action_t.cpu().numpy())
                terminal = terminated or truncated
                log_probs_t = torch.cat((log_probs_t, log_prob_t.unsqueeze(0)))
                entropies_t = torch.cat((entropies_t, entropy.unsqueeze(0)))
                values_t = torch.cat((values_t, value_t))  
                self.rewards.append(reward)  
                self.episode_reward += reward
                self.last_steps_rewards.append(reward)                
                last_val = value_t.detach().item()         
                self.framestack.append(state)
                

                if terminal or (step % trajectory_len == 0): #Закончилась траектория или эпизод
                    # Считаем дисконтированные награды
                    returns = []                 
                    if terminal:
                        R = 0
                    else:                    
                        R = last_val                                                   
                    for r in self.rewards[::-1]:
                        R = r + gamma*R 
                        returns.append(R)
                    returns.reverse()                
                    if terminal:
                        # В случае конца эпизода проверяем дополнительные условия
                        if self.episodes_left_to_do > 0: # Если не конец всех эпизодов
                            self.episode_rewards.append(self.episode_reward) 
                            return_ep_reward = self.episode_reward
                            self.new_episode() 
                            self.episodes_left_to_do -= 1                                                                              
                        else:                        
                            # Если конец всех эпизодов, ставим заглушку на эту функцию, 
                            # чтобы она при повторных вызовах возвращала только None
                            # Менеджер и не вызовет ее повторно, перед вызовом проверяет флаг.                    
                            self.__is_alive = False 
                            return_ep_reward = None  
                                             
                    break                                                
        return return_ep_reward, returns, log_probs_t, values_t, entropies_t, step
         
    #----------------------------------------------------------------------------------------------------------------------
    
    def log_stats(self, writer):
        if len(self.episode_rewards)>0:
            writer.add_scalar('worker/'+str(self.id)+'/Ep.reward', self.episode_rewards[-1], self.steps_done)
        writer.add_scalar('worker/'+str(self.id)+'/last 100 reward mean', np.mean(self.last_steps_rewards), self.steps_done)
        

Класс менеджера

In [134]:

class Manager:
    def __init__(self,                 
                 gamma = 0.99,
                 tau = 0.5, # Коэф. для Polyak update 
                 lr = 0.00001,
                 ent_coef = 0.0001, #масштабирование enropy loss
                 clip_grad_norm = 40, #ограничение градиента
                 value_coef = 0.5, #масштабирование value loss
                 steps_per_update = 100, #Длина траектории. Если поставить дробь на которую нет целочисленных делений (0.3)                 
                 checkpoint_reward_threshold = 300, #Сохранять лучшую модель при награде больше чем...
                 num_workers = 1-6, # количество рабочих сетей
                 num_episodes = 10000, # это количества поделится на всех рабочих
                 worker_start_offset = 100,
                 start_model_path = None, #загрузить модель и начать с нее  
                 fstack_size = 8               
                 ):
        
        
        self.gamma = gamma
        self.tau = tau
        self.ent_coef = ent_coef
        self.clip_grad_norm = clip_grad_norm
        self.value_coef = value_coef
        self.steps_per_update = steps_per_update
        self.checkpoint_reward_threshold = checkpoint_reward_threshold
        self.num_workers = num_workers
        self.num_episodes = num_episodes
        self.worker_start_offset = worker_start_offset
        self.start_model_path = start_model_path
        self.max_eval_reward = -10000
        self.fstack_size = fstack_size
        
        self.loss_p_log = deque(maxlen=100)
        self.loss_v_log = deque(maxlen=100)
        self.loss_e_log = deque(maxlen=100)
        self.loss_log = deque(maxlen=100)
        
        self.best_eval_reward = -1000
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.target_net = CarRacing_CNN_Model(self.fstack_size).to(self.device)
        #self.target_net.share_memory()
        self.optimizer = optim.Adam(self.target_net.parameters(), lr=lr, weight_decay=0.000005) #, weight_decay=0.00001
        self.scheduler = torch.optim.lr_scheduler.StepLR(
                    self.optimizer, step_size=100, gamma=0.9)
        self.tf_writer = SummaryWriter(flush_secs = 10, comment="_CarRacingA2C")
        
        
        os.environ['PYVIRTUALDISPLAY_DISPLAYFD'] = '0'
        # create the directory to store the video(s)
        os.makedirs("./video", exist_ok=True)
        display = Display(visible=False, size=(800, 600))
        _ = display.start()

        #Создаем рабочих
        self.workers = [Worker(id, self.target_net, num_episodes=self.num_episodes//self.num_workers, fstack_size = self.fstack_size, device=self.device) for id in range(self.num_workers)]
        
    #---------------------------------------------------------------------------------------------------------------------------------    
    
    def train_model(self):
        episodes_done = 0
        steps_done = 0               
        live_workers = 8         
        skip_this_worker = 7        
        traininng_complete = False        
        best_ep_reward = -1000  
        if self.start_model_path is not None:
            self.target_net.load_state_dict(torch.load(self.start_model_path, map_location=self.device))
            print (f'Model loaded from {self.start_model_path}')        
        while live_workers > 0:            
            # На каждом проходе цикла пропускаем одного рабочего, если все живы
            if live_workers == self.num_workers:
                skip_this_worker = np.random.randint(0, live_workers) 
                live_workers += self.workers[skip_this_worker].kill()
            else:    
                live_workers += self.workers[skip_this_worker].revive()
            for worker in self.workers:                
                if worker.is_alive():  
                      
                    ep_reward, returns, log_probs, values, entropies, steps = worker.get_n_step_data(self.steps_per_update, self.gamma)
                    if ep_reward is not None:
                        terminal = True
                        self.tf_writer.add_scalar('Reward/Episode reward', ep_reward, steps_done)
                    else:
                        terminal = False
                    steps_done += steps
                    
                    self.update_target_weights(worker, returns, log_probs, values, entropies)
                    self.tf_writer.add_scalar('Loss/Loss', np.mean(self.loss_log), steps_done)
                    self.tf_writer.add_scalar('Loss/Policy', np.mean(self.loss_p_log), steps_done)
                    self.tf_writer.add_scalar('Loss/Value', np.mean(self.loss_v_log), steps_done)
                    self.tf_writer.add_scalar('Loss/Entropy', np.mean(self.loss_e_log), steps_done)                    
                    self.tf_writer.flush()
                    
                    if terminal:
                        episodes_done += 1                        
                        print(f'Worker {worker.id} finished episode {episodes_done}. Reward {ep_reward:.0f}. LR={self.scheduler.get_last_lr()}')
                        
                        # При завершении эпизода проверяем, нужно ли сохранять модель
                        if ep_reward > best_ep_reward:
                            for worker in self.workers:
                                worker.best_ep_reward = ep_reward
                        if ep_reward > best_ep_reward+50 and ep_reward > self.checkpoint_reward_threshold:
                            print(f'New best episode reward {ep_reward:.0f}. Evaluating worker network, saving video and checkpoint')
                            best_ep_reward = ep_reward
                            target_reward = self.greedy_run(worker.model, episodes_done, render = True)                                                                                                               
                            self.tf_writer.add_scalar('Target/Trigger', ep_reward, steps_done)
                            self.tf_writer.add_scalar('Target/Evaluation Reward', target_reward, steps_done)
                            torch.save(worker.model.state_dict(), f'./models/model_ep_{episodes_done}_rew_{ep_reward:.0f}.pth')
                            print (f'Model saved to ./models/model_ep_{episodes_done}_rew_{ep_reward:.0f}.pth')                             
                        if ep_reward > 950: #Проверка на завершение обучения
                            worker_rewards=[]
                            print ('Ep reward > 950, evaluating worker network for 10 eps')
                            for _ in range(10):
                                worker_rewards.append(self.greedy_run(worker.model, episodes_done, render=False))
                            print (f'Target network evaluation done. Mean 10 eps reward {np.mean(worker_rewards):.0f}')
                            if np.mean(worker_rewards) > 800:
                                print (f'Since mean 10 eps reward >800 ({np.mean(worker_rewards):.0f}), will do a full 100 eps evaluation')
                                worker_rewards=[]
                                for _ in range(100):
                                    worker_rewards.append(self.greedy_run(worker.model, episodes_done, render=False))
                                if np.mean(worker_rewards) > 900:
                                    print (f'Mean 100 eps reward >900 ({np.mean(worker_rewards):.0f}), training finished')  
                                    traininng_complete = True
                                    print (f'Saving final model ./models/Final_model_ep_{episodes_done}_rew_{ep_reward:.0f}.pth') 
                                    torch.save(worker.model.state_dict(), f'./models/Final_model_ep_{episodes_done}_rew_{ep_reward:.0f}.pth')
                                self.greedy_run(worker.model, episodes_done, render=True)
            if traininng_complete:
                break
        print (f'Training done.')
        
        
    #--------------------------------------------------------------------------------------------------------------------------------- 
      
    def update_target_weights(self, worker, returns, log_probs_t, values_t, entropies_t):
        
        # считаем лоссы 
        returns_t = torch.tensor(returns, dtype=torch.float32).to(self.device)
        advantage_t = returns_t - values_t

        value_loss = F.mse_loss(returns_t, values_t)*self.value_coef       
        policy_loss = -1*(advantage_t.unsqueeze(-1) * log_probs_t).mean() # 
        entropy_loss = self.ent_coef * entropies_t.mean()    
        
        loss = policy_loss + value_loss + entropy_loss  

        self.loss_v_log.append(value_loss.detach().cpu())
        self.loss_p_log.append(policy_loss.detach().cpu())
        self.loss_e_log.append(entropy_loss.detach().cpu())
        self.loss_log.append(loss.detach().cpu())

        loss.backward()
        # torch.nn.utils.clip_grad_norm_(worker.model.parameters(), self.clip_grad_norm)
        # сейчас просчитались градиенты рабочего. Теперь их надо загрузить в целевую сеть и обновить веса целевой сети (сделать шаг оптимизатора)
        for worker_param, target_param in zip(worker.model.named_parameters(),self.target_net.named_parameters()):
            target_param[1].grad = worker_param[1].grad    
        
        self.optimizer.step()

        
        # обнуляем градиенты везде куда дотянемся
        self.optimizer.zero_grad()
        self.target_net.zero_grad() 
        self.target_net.reset_gru() #Следующая последовательнось от другого рабочего    
        #загружаем веса из таргета в рабочую модель
        
        
     
        worker.model.load_state_dict(self.target_net.state_dict())
        
        worker.model.zero_grad()  
        

        return
    
    def render_mp4(self, videopath: str) -> str:
        mp4 = open(videopath, 'rb').read()
        base64_encoded_mp4 = b64encode(mp4).decode()
        return f'<video width=400 controls><source src="data:video/mp4;' \
                f'base64,{base64_encoded_mp4}" type="video/mp4"></video>'
                
    def greedy_run(self, model, episodes_done, render = False):
        eval_model = CarRacing_CNN_Model(self.fstack_size).to(self.device)          
        eval_model.load_state_dict(model.state_dict())
        env = gym.make("CarRacing-v2", render_mode="rgb_array", lap_complete_percent=0.95, domain_randomize=False) 
        if render:
            src_path = f"video/EVAL_{env.spec.id}_ep_{episodes_done}.mp4" 
        episode_reward = 0
        steps=0
        env.reset()                 
        tiles_open = 0    
        off_track=0         
        framestack = deque(maxlen=self.fstack_size)                              
        if render:
            vid = VideoRecorder(env, path=src_path) 
                     
        # Пропускаем наезд камеры в начале эпизода и заполняем буфер перед началом цикла:
        for _ in range(50):
            if render:
                env.render()
                vid.capture_frame()
            action_none = np.array([0,0,0])
            state, _, _, _, _ = env.step(action_none)
            framestack.append(state) 
            steps +=1    
                 
        # Основной цикл эпизода - отдаем выше данные каждые n-steps а также по концу эпизода:
        while True:
            if render:
                env.render()
                vid.capture_frame()
            _, _, action_t, _ = self.workers[0].predict(eval_model, framestack, greedy=True)            
            state, reward, terminated, truncated, _ = env.step(action_t.cpu().detach().numpy())            
            episode_reward += reward
            framestack.append(state)
            steps +=1   
            if reward > 0:
                tiles_open += 1
            if reward == -100:
                off_track += 1
            if terminated or truncated:
                break
        if render:
            vid.close()
            os.rename(src_path, f"video/EVAL_{env.spec.id}_ep_{episodes_done}_rew_{episode_reward:3.1f}.mp4")
        print (f'Target run, ep. {episodes_done}. Steps: {steps}  Reward: {episode_reward:3.1f}. Tiles opened: {np.mean(tiles_open):3.1f}. Gone out: {off_track:3.1f}')
        return episode_reward

Запуск всего этого добра

In [135]:
manager = Manager (
    gamma = 0.99, 
    lr = 0.00001,
    ent_coef = 0.0001,
    clip_grad_norm = 40,
    value_coef = 0.5,
    steps_per_update = 16, 
    checkpoint_reward_threshold = 100,
    num_workers = 8,
    num_episodes = 100000, # Поделятся на всех рабочих
    start_model_path = None, # 'models/model_ep_3247_rew_882).pth'
    tau = 0.99,
    fstack_size = 4
)

manager.train_model()

Worker 0 is ready
Worker 1 is ready
Worker 2 is ready
Worker 3 is ready
Worker 4 is ready
Worker 5 is ready
Worker 6 is ready
Worker 7 is ready
Worker 0 finished episode 1. Reward -76. LR=[1e-05]
Worker 1 finished episode 2. Reward -72. LR=[1e-05]
Worker 2 finished episode 3. Reward -69. LR=[1e-05]
Worker 3 finished episode 4. Reward -72. LR=[1e-05]
Worker 7 finished episode 5. Reward -77. LR=[1e-05]
Worker 5 finished episode 6. Reward -73. LR=[1e-05]
Worker 6 finished episode 7. Reward -67. LR=[1e-05]
Worker 4 finished episode 8. Reward -71. LR=[1e-05]
Worker 3 finished episode 9. Reward -76. LR=[1e-05]
Worker 0 finished episode 10. Reward -82. LR=[1e-05]
Worker 1 finished episode 11. Reward -77. LR=[1e-05]
Worker 5 finished episode 12. Reward -78. LR=[1e-05]
Worker 6 finished episode 13. Reward -78. LR=[1e-05]
Worker 2 finished episode 14. Reward -85. LR=[1e-05]
Worker 7 finished episode 15. Reward -81. LR=[1e-05]
Worker 4 finished episode 16. Reward -82. LR=[1e-05]
Worker 0 finished

                                                                

Moviepy - Done !
Moviepy - video ready video/EVAL_CarRacing-v2_ep_412.mp4
Target run, ep. 412. Steps: 1000  Reward: -39.6. Tiles opened: 15.0. Gone out: 0.0
Model saved to ./models/model_ep_412_rew_111.pth
Worker 6 finished episode 413. Reward -33. LR=[1e-05]
Worker 7 finished episode 414. Reward -17. LR=[1e-05]
Worker 5 finished episode 415. Reward -26. LR=[1e-05]
Worker 4 finished episode 416. Reward -59. LR=[1e-05]
Worker 2 finished episode 417. Reward -72. LR=[1e-05]
Worker 0 finished episode 418. Reward 26. LR=[1e-05]
Worker 3 finished episode 419. Reward -63. LR=[1e-05]
Worker 1 finished episode 420. Reward -21. LR=[1e-05]
Worker 6 finished episode 421. Reward -34. LR=[1e-05]
Worker 7 finished episode 422. Reward -25. LR=[1e-05]
Worker 5 finished episode 423. Reward -30. LR=[1e-05]
Worker 4 finished episode 424. Reward -44. LR=[1e-05]
Worker 0 finished episode 425. Reward -16. LR=[1e-05]
Worker 2 finished episode 426. Reward 48. LR=[1e-05]
Worker 3 finished episode 427. Reward -3

                                                                

Moviepy - Done !
Moviepy - video ready video/EVAL_CarRacing-v2_ep_462.mp4
Target run, ep. 462. Steps: 1000  Reward: 36.5. Tiles opened: 38.0. Gone out: 0.0
Model saved to ./models/model_ep_462_rew_163.pth
Worker 4 finished episode 463. Reward 115. LR=[1e-05]
Worker 5 finished episode 464. Reward -39. LR=[1e-05]
Worker 2 finished episode 465. Reward -35. LR=[1e-05]
Worker 1 finished episode 466. Reward 42. LR=[1e-05]
Worker 3 finished episode 467. Reward -22. LR=[1e-05]
Worker 0 finished episode 468. Reward 151. LR=[1e-05]
Worker 6 finished episode 469. Reward -36. LR=[1e-05]
Worker 7 finished episode 470. Reward 63. LR=[1e-05]
Worker 4 finished episode 471. Reward -35. LR=[1e-05]
Worker 5 finished episode 472. Reward -53. LR=[1e-05]
Worker 2 finished episode 473. Reward 67. LR=[1e-05]
Worker 1 finished episode 474. Reward -34. LR=[1e-05]
Worker 3 finished episode 475. Reward 25. LR=[1e-05]
Worker 0 finished episode 476. Reward -29. LR=[1e-05]
Worker 6 finished episode 477. Reward -42. 

                                                                

Moviepy - Done !
Moviepy - video ready video/EVAL_CarRacing-v2_ep_505.mp4
Target run, ep. 505. Steps: 1000  Reward: 31.1. Tiles opened: 29.0. Gone out: 0.0
Model saved to ./models/model_ep_505_rew_258.pth
Worker 3 finished episode 506. Reward -42. LR=[1e-05]
Worker 1 finished episode 507. Reward 121. LR=[1e-05]
Worker 0 finished episode 508. Reward 8. LR=[1e-05]
Worker 6 finished episode 509. Reward 23. LR=[1e-05]
Worker 7 finished episode 510. Reward 254. LR=[1e-05]
Worker 5 finished episode 511. Reward 15. LR=[1e-05]
Worker 4 finished episode 512. Reward 98. LR=[1e-05]
Worker 2 finished episode 513. Reward -39. LR=[1e-05]
Worker 1 finished episode 514. Reward -2. LR=[1e-05]
Worker 3 finished episode 515. Reward -28. LR=[1e-05]
Worker 0 finished episode 516. Reward -36. LR=[1e-05]
Worker 6 finished episode 517. Reward 94. LR=[1e-05]
Worker 7 finished episode 518. Reward 60. LR=[1e-05]
Worker 5 finished episode 519. Reward 53. LR=[1e-05]
Worker 4 finished episode 520. Reward 6. LR=[1e-

                                                                

Moviepy - Done !
Moviepy - video ready video/EVAL_CarRacing-v2_ep_525.mp4
Target run, ep. 525. Steps: 1000  Reward: 71.1. Tiles opened: 53.0. Gone out: 0.0
Model saved to ./models/model_ep_525_rew_336.pth
Worker 7 finished episode 526. Reward -68. LR=[1e-05]
Worker 5 finished episode 527. Reward 115. LR=[1e-05]
Worker 4 finished episode 528. Reward 207. LR=[1e-05]
Worker 2 finished episode 529. Reward 42. LR=[1e-05]
Worker 1 finished episode 530. Reward 163. LR=[1e-05]
Worker 3 finished episode 531. Reward 39. LR=[1e-05]
Worker 0 finished episode 532. Reward 49. LR=[1e-05]
Worker 6 finished episode 533. Reward -22. LR=[1e-05]
Worker 5 finished episode 534. Reward 72. LR=[1e-05]
Worker 7 finished episode 535. Reward 137. LR=[1e-05]
Worker 4 finished episode 536. Reward 277. LR=[1e-05]
Worker 2 finished episode 537. Reward 58. LR=[1e-05]
Worker 1 finished episode 538. Reward 190. LR=[1e-05]
Worker 3 finished episode 539. Reward -36. LR=[1e-05]
Worker 0 finished episode 540. Reward 1. LR=

                                                                

Moviepy - Done !
Moviepy - video ready video/EVAL_CarRacing-v2_ep_578.mp4
Target run, ep. 578. Steps: 1000  Reward: -15.5. Tiles opened: 21.0. Gone out: 0.0
Model saved to ./models/model_ep_578_rew_519.pth
Worker 3 finished episode 579. Reward 92. LR=[1e-05]
Worker 0 finished episode 580. Reward 88. LR=[1e-05]
Worker 6 finished episode 581. Reward 36. LR=[1e-05]
Worker 4 finished episode 582. Reward 242. LR=[1e-05]
Worker 7 finished episode 583. Reward 18. LR=[1e-05]
Worker 5 finished episode 584. Reward 173. LR=[1e-05]
Worker 2 finished episode 585. Reward 56. LR=[1e-05]
Worker 3 finished episode 586. Reward 63. LR=[1e-05]
Worker 1 finished episode 587. Reward 92. LR=[1e-05]
Worker 6 finished episode 588. Reward 143. LR=[1e-05]
Worker 0 finished episode 589. Reward 25. LR=[1e-05]
Worker 4 finished episode 590. Reward 150. LR=[1e-05]
Worker 7 finished episode 591. Reward 36. LR=[1e-05]
Worker 5 finished episode 592. Reward 242. LR=[1e-05]
Worker 2 finished episode 593. Reward 173. LR=[

                                                                

Moviepy - Done !
Moviepy - video ready video/EVAL_CarRacing-v2_ep_803.mp4
Target run, ep. 803. Steps: 1000  Reward: -10.9. Tiles opened: 25.0. Gone out: 0.0
Model saved to ./models/model_ep_803_rew_719.pth
Worker 0 finished episode 804. Reward 177. LR=[1e-05]
Worker 2 finished episode 805. Reward 242. LR=[1e-05]
Worker 6 finished episode 806. Reward 337. LR=[1e-05]
Worker 7 finished episode 807. Reward 184. LR=[1e-05]
Worker 5 finished episode 808. Reward 291. LR=[1e-05]
Worker 1 finished episode 809. Reward 195. LR=[1e-05]
Worker 3 finished episode 810. Reward 185. LR=[1e-05]
Worker 4 finished episode 811. Reward 507. LR=[1e-05]
Worker 0 finished episode 812. Reward 175. LR=[1e-05]
Worker 6 finished episode 813. Reward 269. LR=[1e-05]
Worker 2 finished episode 814. Reward 438. LR=[1e-05]
Worker 7 finished episode 815. Reward 481. LR=[1e-05]
Worker 5 finished episode 816. Reward 195. LR=[1e-05]
Worker 1 finished episode 817. Reward 144. LR=[1e-05]
Worker 3 finished episode 818. Reward 

                                                                

Moviepy - Done !
Moviepy - video ready video/EVAL_CarRacing-v2_ep_1477.mp4
Target run, ep. 1477. Steps: 1000  Reward: 7.1. Tiles opened: 29.0. Gone out: 0.0
Model saved to ./models/model_ep_1477_rew_786.pth
Worker 4 finished episode 1478. Reward 340. LR=[1e-05]
Worker 0 finished episode 1479. Reward 387. LR=[1e-05]
Worker 5 finished episode 1480. Reward 314. LR=[1e-05]
Worker 7 finished episode 1481. Reward 694. LR=[1e-05]
Worker 1 finished episode 1482. Reward 239. LR=[1e-05]
Worker 3 finished episode 1483. Reward 523. LR=[1e-05]
Worker 2 finished episode 1484. Reward 360. LR=[1e-05]
Worker 6 finished episode 1485. Reward 475. LR=[1e-05]
Worker 4 finished episode 1486. Reward 296. LR=[1e-05]
Worker 0 finished episode 1487. Reward 659. LR=[1e-05]
Worker 5 finished episode 1488. Reward 44. LR=[1e-05]
Worker 7 finished episode 1489. Reward 520. LR=[1e-05]
Worker 2 finished episode 1490. Reward 335. LR=[1e-05]
Worker 3 finished episode 1491. Reward 425. LR=[1e-05]
Worker 1 finished episod

                                                                

Moviepy - Done !
Moviepy - video ready video/EVAL_CarRacing-v2_ep_1782.mp4
Target run, ep. 1782. Steps: 1000  Reward: 13.8. Tiles opened: 31.0. Gone out: 0.0
Model saved to ./models/model_ep_1782_rew_858.pth
Worker 1 finished episode 1783. Reward 695. LR=[1e-05]
Worker 4 finished episode 1784. Reward 704. LR=[1e-05]
Worker 5 finished episode 1785. Reward 344. LR=[1e-05]
Worker 7 finished episode 1786. Reward 621. LR=[1e-05]
Worker 3 finished episode 1787. Reward 336. LR=[1e-05]
Worker 2 finished episode 1788. Reward 613. LR=[1e-05]
Worker 6 finished episode 1789. Reward 284. LR=[1e-05]
Worker 0 finished episode 1790. Reward 342. LR=[1e-05]
Worker 4 finished episode 1791. Reward 460. LR=[1e-05]
Worker 1 finished episode 1792. Reward 253. LR=[1e-05]
Worker 5 finished episode 1793. Reward 501. LR=[1e-05]
Worker 2 finished episode 1794. Reward 758. LR=[1e-05]
Worker 7 finished episode 1795. Reward 474. LR=[1e-05]
Worker 3 finished episode 1796. Reward 249. LR=[1e-05]
Worker 6 finished epis

                                                                

Moviepy - Done !
Moviepy - video ready video/EVAL_CarRacing-v2_ep_3459.mp4
Target run, ep. 3459. Steps: 1000  Reward: 409.7. Tiles opened: 157.0. Gone out: 0.0
Model saved to ./models/model_ep_3459_rew_924.pth
Worker 0 finished episode 3460. Reward 877. LR=[1e-05]
Worker 1 finished episode 3461. Reward 485. LR=[1e-05]
Worker 7 finished episode 3462. Reward 634. LR=[1e-05]
Worker 4 finished episode 3463. Reward 869. LR=[1e-05]
Worker 5 finished episode 3464. Reward 528. LR=[1e-05]
Worker 6 finished episode 3465. Reward 449. LR=[1e-05]
Worker 2 finished episode 3466. Reward 353. LR=[1e-05]
Worker 3 finished episode 3467. Reward 573. LR=[1e-05]
Worker 0 finished episode 3468. Reward 449. LR=[1e-05]
Worker 1 finished episode 3469. Reward 847. LR=[1e-05]
Worker 7 finished episode 3470. Reward 868. LR=[1e-05]
Worker 4 finished episode 3471. Reward 548. LR=[1e-05]
Worker 5 finished episode 3472. Reward 346. LR=[1e-05]
Worker 6 finished episode 3473. Reward 856. LR=[1e-05]
Worker 2 finished ep

KeyboardInterrupt: 