# TD3 algo for "bipedal-walker-v2" 
# https://gym.openai.com/envs/BipedalWalker-v2/

Это реализация алгоритма TD3. \
Как до него дошёл: \
1) Увидел задание, решил уже известным алгоритмом Augmented Random Search. Итог 300+ за ~1000 итераций. Это хоршо, но в задании алгоритм автор--критик. Незаметил( \
2) Алогитм DDPG выдал >100 за 500 шагов, резкий пик в 202 за 642 шагов, затем упал к +20, и к 1500 шагам оттуда не вылез. \
3) Пошел гуглить альтернативные алгоритмы. Нашел список actor-critic model, поподробнее прочитал про DDPG, нашел описание его минусов(один из них, возможно, и привел к быстрому проседанию до скора 20(он таковым остался и через 700 итераций +-), после резкого пика в 202. Это ситуаия, когда Q-function начинает переоценивать какое-нибудь Q-value, что приводит к искажению policy). В качестве их решения был предложен TD3. \
Подробнее -- в конспекте-выдежке из нижних статей:
https://drive.google.com/drive/folders/1NIasAAWASQdg8xAk4xz4qo4M1-PCHhOa?usp=sharing

Статьи: \
[ARS] --https://arxiv.org/pdf/1803.07055.pdf \
[DDPG] --https://spinningup.openai.com/en/latest/algorithms/ddpg.html#background \
[TD3] -- https://spinningup.openai.com/en/latest/algorithms/td3.html


In [0]:
!pip install gym
!apt-get update
!apt-get -qq -y install xvfb freeglut3-dev ffmpeg> /dev/null
!apt-get install xvfb
!pip install pyvirtualdisplay
!pip -q install pyglet
!pip -q install pyopengl

Hit:1 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease
Hit:2 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:3 http://security.ubuntu.com/ubuntu bionic-security InRelease [83.2 kB]
Get:4 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:5 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Ign:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1710/x86_64  InRelease
Ign:7 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64  InRelease
Hit:8 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1710/x86_64  Release
Hit:9 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64  Release
Get:10 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages [278 kB]
Get:11 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Packages [130 kB]
Get:12 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 Packages

In [0]:
!apt-get install swig
!pip install box2d box2d-kengz

Reading package lists... 0%Reading package lists... 0%Reading package lists... 0%Reading package lists... 9%Reading package lists... 9%Reading package lists... 9%Reading package lists... 9%Reading package lists... 83%Reading package lists... 83%Reading package lists... 83%Reading package lists... 84%Reading package lists... 84%Reading package lists... 88%Reading package lists... 88%Reading package lists... 88%Reading package lists... 88%Reading package lists... 95%Reading package lists... 95%Reading package lists... 95%Reading package lists... 95%Reading package lists... 95%Reading package lists... 95%Reading package lists... 97%Reading package lists... 97%Reading package lists... 98%Reading package lists... 98%Reading package lists... 98%Reading package lists... 98%Reading package lists... 98%Reading package lists... 98%Reading package lists... 99%Reading package lists... 99%Reading package lists... 99%Reading package 

In [0]:
# Start virtual display
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1024, 768))
display.start()
import os
os.environ["DISPLAY"] = ":" + str(display.display) + "." + str(display.screen)

In [0]:
!pip install pybullet



In [0]:
import os
import numpy as np
import gym
from gym import wrappers

In [0]:
from os import path
from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag
platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())

accelerator = 'cu80' if path.exists('/opt/bin/nvidia-smi') else 'cpu'

!pip install -q http://download.pytorch.org/whl/{accelerator}/torch-0.4.0-{platform}-linux_x86_64.whl torchvision
import torch
print(torch.__version__)
print(torch.cuda.is_available())

0.4.0
True


In [0]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [0]:
# This is implementation of the set D, discribed in the paper, 
# and in the my notes.

class ReplayBuffer:
    def __init__(self):
        self.buffer = []
    
    def add(self, transition):
        #(state, action, reward, next_state, done)
        self.buffer.append(transition)
    
    # on some step of TD3, I should chose random item from set.
    # Why random?
    # Because optional Q-func satisfy Bellman equation for all posible transitions
    # more details -- in my notes, page2
    def sample(self, batch_size):
        indexes = np.random.randint(0, len(self.buffer), size=batch_size)
        state, action, reward, next_state, done = [], [], [], [], []
        
        for i in indexes:
            s, a, r, s_, d = self.buffer[i]
            state.append(np.array(s, copy=False))
            action.append(np.array(a, copy=False))
            reward.append(np.array(r, copy=False))
            next_state.append(np.array(s_, copy=False))
            done.append(np.array(d, copy=False))
        
        return np.array(state), np.array(action), np.array(reward), np.array(next_state), np.array(done)


In [0]:


device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

class MyActor(nn.Module):
    def __init__(self, state_dim, action_dim, max_action):
        super(MyActor, self).__init__()
        self.main = nn.Sequential(
            torch.nn.Linear(state_dim, 400),
            torch.nn.ReLU(),
            torch.nn.Linear(400, 300),
            torch.nn.ReLU(),
            torch.nn.Linear(300,action_dim)
        )
        self.max_action = max_action
    def forward(self, state):
        output = torch.tanh(self.main(state)) * self.max_action
        return output
      
      
class MyCritic(nn.Module):    
    def __init__(self,state_dim, action_dim):
        super(MyCritic, self).__init__()
        self.main = nn.Sequential(
            torch.nn.Linear(state_dim + action_dim, 400),
            torch.nn.ReLU(),
            torch.nn.Linear(400, 300),
            torch.nn.ReLU(),
            torch.nn.Linear(300,1)
        )
    def forward(self, state, action):
        output = self.main(torch.cat([state, action], 1))
        return output
    
# WHY and how does it work -- in my notes 
class TD3:
    def __init__(self, state_dim, action_dim, max_action):
        
        self.actor = MyActor(state_dim, action_dim, max_action).float().to(device)
        self.actor_target = MyActor(state_dim, action_dim, max_action).float().to(device)
        self.actor_target.load_state_dict(self.actor.state_dict())
        self.actor_optimizer = optim.Adam(self.actor.parameters())
        
        self.critic_1 = MyCritic(state_dim, action_dim).float().to(device)
        self.critic_1_target = MyCritic(state_dim, action_dim).float().to(device)
        self.critic_1_target.load_state_dict(self.critic_1.state_dict())
        self.critic_1_optimizer = optim.Adam(self.critic_1.parameters())
        
        self.critic_2 = MyCritic(state_dim, action_dim).float().to(device)
        self.critic_2_target = MyCritic(state_dim, action_dim).float().to(device)
        self.critic_2_target.load_state_dict(self.critic_2.state_dict())
        self.critic_2_optimizer = optim.Adam(self.critic_2.parameters())
        
        self.max_action = max_action
    
    def select_action(self, state):
        state = torch.FloatTensor(state.reshape(1, -1)).float().to(device)
        return self.actor(state).cpu().data.numpy().flatten()
    
    def update(self, replay_buffer, n_iter, batch_size, gamma, p, policy_noise, noise_clip, policy_delay):
        
        for i in range(n_iter):
            # Sample a batch of transitions from replay buffer:
            state, action_, reward, next_state, done = replay_buffer.sample(batch_size)
            state = torch.FloatTensor(state).float().to(device)
            action = torch.FloatTensor(action_).float().to(device)
            reward = torch.FloatTensor(reward).reshape((batch_size,1)).float().to(device)
            next_state = torch.FloatTensor(next_state).float().to(device)
            done = torch.FloatTensor(done).reshape((batch_size,1)).float().to(device)
            
            # Select next action according to target policy:
            noise = torch.FloatTensor(action_).data.normal_(0, policy_noise).float().to(device)
            noise = noise.clamp(-noise_clip, noise_clip)
            next_action = (self.actor_target(next_state) + noise)
            next_action = next_action.clamp(-self.max_action, self.max_action)
            
            # Compute target Q-value:
            target_Q1 = self.critic_1_target(next_state, next_action)
            target_Q2 = self.critic_2_target(next_state, next_action)
            target_Q = torch.min(target_Q1, target_Q2)
            target_Q = reward + ((1-done) * gamma * target_Q).detach()
            
            # Optimize Critic 1:
            current_Q1 = self.critic_1(state, action)
            loss_Q1 = F.mse_loss(current_Q1, target_Q)
            self.critic_1_optimizer.zero_grad()
            loss_Q1.backward()
            self.critic_1_optimizer.step()
            
            # Optimize Critic 2:
            current_Q2 = self.critic_2(state, action)
            loss_Q2 = F.mse_loss(current_Q2, target_Q)
            self.critic_2_optimizer.zero_grad()
            loss_Q2.backward()
            self.critic_2_optimizer.step()
            
            # Delayed policy updates:
            if i % policy_delay == 0:
                # Compute actor loss:
                actor_loss = -self.critic_1(state, self.actor(state)).mean()
                
                # Optimize the actor
                self.actor_optimizer.zero_grad()
                actor_loss.backward()
                self.actor_optimizer.step()
                
                # p, ("poliyac" in the article) updating
                for param, target_param in zip(self.actor.parameters(), self.actor_target.parameters()):
                    target_param.data.copy_( (p * target_param.data) + ((1-p) * param.data))
                
                for param, target_param in zip(self.critic_1.parameters(), self.critic_1_target.parameters()):
                    target_param.data.copy_( (p * target_param.data) + ((1-p) * param.data))
                
                for param, target_param in zip(self.critic_2.parameters(), self.critic_2_target.parameters()):
                    target_param.data.copy_( (p * target_param.data) + ((1-p) * param.data))

In [0]:


def train():
    ######### Hyperparameters #########
    env_name = "BipedalWalker-v2"
    log_interval = 10           # print avg reward after interval
    random_seed = 0
    gamma = 0.99                # discount for future rewards
    batch_size = 100            # num of transitions sampled from replay buffer
    exploration_noise = 0.1 
    p = 0.995              # target policy update parameter (1-tau)
    policy_noise = 0.2          # target policy smoothing noise
    noise_clip = 0.5
    policy_delay = 2            # delayed policy updates parameter
    max_episodes = 10000         # max num of episodes
    max_timesteps = 2000        # max timesteps in one episode
    directory = "./preTrained/{}".format(env_name) # save trained models
    filename = "TD3_{}_{}".format(env_name, random_seed)
    ###################################
    
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.shape[0]
    max_action = float(env.action_space.high[0])
    
    policy = TD3(state_dim, action_dim, max_action)
    replay_buffer = ReplayBuffer()
    
    if random_seed:
        print("Random Seed: {}".format(random_seed))
        env.seed(random_seed)
        torch.manual_seed(random_seed)
        np.random.seed(random_seed)
    
    # logging variables:
    avg_reward = 0
    ep_reward = 0
    log_f = open("log.txt","w+")
    
    # training procedure:
    for episode in range(1, max_episodes+1):
        state = env.reset()
        for t in range(max_timesteps):
            # select action and add exploration noise:
            action = policy.select_action(state)
            action = action + np.random.normal(0, exploration_noise, size=env.action_space.shape[0])
            action = action.clip(env.action_space.low, env.action_space.high)
            
            # take action in env:
            next_state, reward, done, _ = env.step(action)
            replay_buffer.add((state, action, reward, next_state, float(done)))
            state = next_state
            
            avg_reward += reward
            ep_reward += reward
            
            # if episode is done then update policy:
            if done or t==(max_timesteps-1):
                policy.update(replay_buffer, t, batch_size, gamma, p, policy_noise, noise_clip, policy_delay)
                break
#         break
                
        if (ep_reward) >= 300:
            print("====FINALLY! GOT IT!=====")
            print("At episode: {}".format(episode))
            log_f.close()
            break
            
        ep_reward = 0
        
        if episode % log_interval == 0:
            avg_reward = int(avg_reward / log_interval)
            print("Episode: {}\tAverage Reward: {}".format(episode, avg_reward))
            avg_reward = 0

train()

  result = entry_point.load(False)


[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Episode: 10	Average Reward: -110
Episode: 20	Average Reward: -107
Episode: 30	Average Reward: -106
Episode: 40	Average Reward: -127
Episode: 50	Average Reward: -112
Episode: 60	Average Reward: -103
Episode: 70	Average Reward: -110
Episode: 80	Average Reward: -118
Episode: 90	Average Reward: -113
Episode: 100	Average Reward: -110
Episode: 110	Average Reward: -92
Episode: 120	Average Reward: -106
Episode: 130	Average Reward: -124
Episode: 140	Average Reward: -139
Episode: 150	Average Reward: -136
Episode: 160	Average Reward: -94
Episode: 170	Average Reward: -83
Episode: 180	Average Reward: -93
Episode: 190	Average Reward: -90
Episode: 200	Average Reward: -83
Episode: 210	Average Reward: -84
Episode: 220	Average Reward: -53
Episode: 230	Average Reward: 19
Episode: 240	Average 