# Dueling - DQN


### References:

Please follow [Human-level control through deep reinforcement learning](https://www.nature.com/articles/nature14236) for the original publication as well as the psuedocode.

In [1]:
'''
Installing packages for rendering the game on Colab
'''

!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1
!apt-get update > /dev/null 2>&1
!apt-get install cmake > /dev/null 2>&1
!pip install --upgrade setuptools 2>&1
!pip install ez_setup > /dev/null 2>&1
!pip install gym[atari] > /dev/null 2>&1
!pip install git+https://github.com/tensorflow/docs > /dev/null 2>&1
!pip install gym[classic_control]



In [2]:
'''
A bunch of imports, you don't have to worry about these
'''

import numpy as np
import random
import torch
import torch.nn as nn
import torch.nn.functional as F
from collections import namedtuple, deque
import torch.optim as optim
import datetime
import gym
from gym.wrappers.record_video import RecordVideo
import glob
import io
import base64
import matplotlib.pyplot as plt
from IPython.display import HTML
from pyvirtualdisplay import Display
import tensorflow as tf
from IPython import display as ipythondisplay
from PIL import Image
#import tensorflow_probability as tfp
import warnings
warnings.filterwarnings("ignore")

In [3]:
'''
Please refer to the first tutorial for more details on the specifics of environments
We've only added important commands you might find useful for experiments.
'''

'''
List of example environments
(Source - https://gym.openai.com/envs/#classic_control)

'Acrobot-v1'
'Cartpole-v1'
'MountainCar-v0'
'''

env = gym.make('CartPole-v1')
env.seed(0)

state_shape = env.observation_space.shape[0]
no_of_actions = env.action_space.n

print("State shape:", state_shape)
print("Number of Actions: ",no_of_actions)
print("Sampled Action",env.action_space.sample())
print("----")

'''
# Understanding State, Action, Reward Dynamics

The agent decides an action to take depending on the state.

The Environment keeps a variable specifically for the current state.
- Everytime an action is passed to the environment, it calculates the new state and updates the current state variable.
- It returns the new current state and reward for the agent to take the next action

'''

state = env.reset()
''' This returns the initial state (when environment is reset) '''

print("Current_State: ",state)
print("----")

action = env.action_space.sample()
''' We take a random action now '''

print("Sampled Action2: ", action)
print("----")

next_state, reward, done, info = env.step(action)
''' env.step is used to calculate new state and obtain reward based on old state and action taken  '''

print("Next_State: ",next_state)
print("Reward: ",reward)
print("Done: ", done)
print("Info: ", info)
print("----")


State shape: 4
Number of Actions:  2
Sampled Action 1
----
Current_State:  [ 0.01369617 -0.02302133 -0.04590265 -0.04834723]
----
Sampled Action2:  1
----
Next_State:  [ 0.01323574  0.17272775 -0.04686959 -0.3551522 ]
Reward:  1.0
Done:  False
Info:  {}
----


## DQN

Using NNs as substitutes isn't something new. It has been tried earlier, but the 'human control' paper really popularised using NNs by providing a few stability ideas (Q-Targets, Experience Replay & Truncation). The 'Deep-Q Network' (DQN) Algorithm can be broken down into having the following components.

### Q-Network:
The neural network used as a function approximator is defined below

In [4]:

import torch
import torch.nn as nn
import torch.nn.functional as F

import random
import torch
import numpy as np
from collections import deque, namedtuple

from scipy.special import softmax

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

'''
Bunch of Hyper parameters (Which you might have to tune later)
'''
'''BUFFER_SIZE = int(1e5)  # replay buffer size
BATCH_SIZE = 64         # minibatch size
GAMMA = 0.99            # discount factor
LR = 5e-4               # learning rate
UPDATE_EVERY = 20       # how often to update the network (When Q target is present)'''

class QNetwork1(nn.Module):

    def __init__(self, state_size, action_size, seed, algo_type=1, num_common_layers =1,num_common_layer_units= 64,num_val_layers=1,num_val_layer_units=128,num_adv_layers=1,num_adv_layer_units=128):
        """Initialize parameters and build model.
        Params
        ======
            state_size (int): Dimension of each state
            action_size (int): Dimension of each action
            seed (int): Random seed
            fc1_units (int): Number of nodes in first hidden layer
            fc2_units (int): Number of nodes in second hidden layer
        """
        super(QNetwork1, self).__init__()
        self.seed = torch.manual_seed(seed)
        self.algo_type = algo_type
        activation = nn.LeakyReLU
        self.fcs = nn.Sequential(*[nn.Linear(state_size, num_common_layer_units),activation()])
        self.fc_common = nn.Sequential(*[nn.Sequential(*[nn.Linear(num_common_layer_units,num_common_layer_units),activation()]) for _ in range(num_common_layers-1)])
        self.fc_adv_start = nn.Sequential(*[nn.Linear(num_common_layer_units,num_adv_layer_units),activation()])
        self.fc_val_start = nn.Sequential(*[nn.Linear(num_common_layer_units,num_val_layer_units),activation()])
        self.fc_adv_hidden = nn.Sequential(*[nn.Sequential(*[nn.Linear(num_adv_layer_units,num_adv_layer_units),activation()]) for _ in range(num_adv_layers-1)])
        self.fc_val_hidden = nn.Sequential(*[nn.Sequential(*[nn.Linear(num_val_layer_units,num_val_layer_units),activation()]) for _ in range(num_val_layers-1)])
        self.fc_advantage = nn.Linear(num_adv_layer_units, action_size)
        self.fc_value = nn.Linear(num_val_layer_units, 1)

    def forward(self, state):
        """Build a network that maps state -> action values."""
        xs = self.fcs(state)
        x_common = self.fc_common(xs)
        x_adv_start = self.fc_adv_start(x_common)
        x_val_start = self.fc_val_start(x_common)
        x_adv_hidden = self.fc_adv_hidden(x_adv_start)
        x_val_hidden = self.fc_val_hidden(x_val_start)
        Val = self.fc_value(x_val_hidden)
        Adv = self.fc_advantage(x_adv_hidden)

        if(self.algo_type==1):
          advAverage = torch.mean(Adv, dim=1, keepdim=True)
          return Val+Adv-advAverage
        elif(self.algo_type==2):
          adv_max, adv_max_ind = torch.max(Adv, dim=1, keepdim=True)
          return Val+Adv-adv_max

class ReplayBuffer:
    """Fixed-size buffer to store experience tuples."""

    def __init__(self, action_size, buffer_size, batch_size, seed):
        """Initialize a ReplayBuffer object.

        Params
        ======
            action_size (int): dimension of each action
            buffer_size (int): maximum size of buffer
            batch_size (int): size of each training batch
            seed (int): random seed
        """
        self.action_size = action_size
        self.memory = deque(maxlen=buffer_size)
        self.batch_size = batch_size
        self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
        self.seed = random.seed(seed)

    def add(self, state, action, reward, next_state, done):
        """Add a new experience to memory."""
        e = self.experience(state, action, reward, next_state, done)
        self.memory.append(e)

    def sample(self):
        """Randomly sample a batch of experiences from memory."""
        experiences = random.sample(self.memory, k=self.batch_size)

        states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to(device)
        actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).long().to(device)
        rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(device)
        next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to(device)
        dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(device)

        return (states, actions, rewards, next_states, dones)

    def __len__(self):
        """Return the current size of internal memory."""
        return len(self.memory)

class TutorialAgent():

    def __init__(self, state_size, action_size, seed, algo_type=1,exploration='epsilon', BUFFER_SIZE = int(1e2), BATCH_SIZE = 32, GAMMA = 0.99, LR = 1e-3, UPDATE_EVERY = 20, num_common_layers =1,num_common_layer_units= 256,num_val_layers=1,num_val_layer_units=256,num_adv_layers=1,num_adv_layer_units=256 ):

        ''' Agent Environment Interaction '''
        self.state_size = state_size
        self.action_size = action_size
        self.seed = random.seed(seed)
        self.exploration = exploration
        self.BUFFER_SIZE = BUFFER_SIZE
        self.BATCH_SIZE = BATCH_SIZE
        self.GAMMA = GAMMA
        self.LR = LR
        self.UPDATE_EVERY = UPDATE_EVERY

        ''' Q-Network '''
        self.qnetwork_local = QNetwork1(state_size, action_size, seed, algo_type=algo_type, num_common_layers = num_common_layers,num_common_layer_units= num_common_layer_units,num_val_layers=num_val_layers,num_val_layer_units=num_val_layer_units,num_adv_layers=num_adv_layers,num_adv_layer_units=num_adv_layer_units).to(device)
        self.qnetwork_target = QNetwork1(state_size, action_size, seed, algo_type=algo_type, num_common_layers = num_common_layers,num_common_layer_units= num_common_layer_units,num_val_layers=num_val_layers,num_val_layer_units=num_val_layer_units,num_adv_layers=num_adv_layers,num_adv_layer_units=num_adv_layer_units).to(device)
        self.optimizer = optim.Adam(self.qnetwork_local.parameters(), lr=LR)

        ''' Replay memory '''
        self.memory = ReplayBuffer(action_size, BUFFER_SIZE, BATCH_SIZE, seed)

        ''' Initialize time step (for updating every UPDATE_EVERY steps)           -Needed for Q Targets '''
        self.t_step = 0

    def step(self, state, action, reward, next_state, done):

        ''' Save experience in replay memory '''
        self.memory.add(state, action, reward, next_state, done)

        ''' If enough samples are available in memory, get random subset and learn '''
        if len(self.memory) >= self.BATCH_SIZE:
            experiences = self.memory.sample()
            self.learn(experiences, self.GAMMA)

        """ +Q TARGETS PRESENT """
        ''' Updating the Network every 'UPDATE_EVERY' steps taken '''
        self.t_step = (self.t_step + 1) % self.UPDATE_EVERY
        if self.t_step == 0:

            self.qnetwork_target.load_state_dict(self.qnetwork_local.state_dict())

    def act(self, state, eps=0.):

        state = torch.from_numpy(state).float().unsqueeze(0).to(device)
        self.qnetwork_local.eval()
        with torch.no_grad():
            action_values = self.qnetwork_local(state)
        self.qnetwork_local.train()
        if(self.exploration == 'softmax'):
          ''' Softmax action selection '''
          tau = 10*eps
          Prob = softmax(action_values.cpu().data.numpy()/tau)
          return random.choices(np.arange(self.action_size),weights=Prob[0])[0]
        else:
          ''' Epsilon-greedy action selection (Already Present) '''
          if random.random() > eps:
              return np.argmax(action_values.cpu().data.numpy())
          else:
              return random.choice(np.arange(self.action_size))


    def learn(self, experiences, gamma):
        """ +E EXPERIENCE REPLAY PRESENT """
        states, actions, rewards, next_states, dones = experiences

        ''' Get max predicted Q values (for next states) from target model'''
        Q_targets_next = self.qnetwork_target(next_states).detach().max(1)[0].unsqueeze(1)

        ''' Compute Q targets for current states '''
        Q_targets = rewards + (gamma * Q_targets_next * (1 - dones))

        ''' Get expected Q values from local model '''
        Q_expected = self.qnetwork_local(states).gather(1, actions)

        ''' Compute loss '''
        loss = F.mse_loss(Q_expected, Q_targets)

        ''' Minimize the loss '''
        self.optimizer.zero_grad()
        loss.backward()

        ''' Gradiant Clipping '''
        """ +T TRUNCATION PRESENT """
        for param in self.qnetwork_local.parameters():
            param.grad.data.clamp_(-1, 1)

        self.optimizer.step()

### Here, we present the DQN algorithm code.

In [5]:
''' Defining DQN Algorithm '''

def dqn(agent, n_episodes=500, max_t=1000, eps_start=0.5, eps_end=0.001, eps_decay=0.995):
    Rewards = []
    Regret = 0
    scores_window = deque(maxlen=100)
    ''' last 100 scores for checking if the avg is more than 195 '''

    eps = eps_start
    ''' initialize epsilon '''

    for i_episode in range(1, n_episodes+1):
        state = env.reset()
        score = 0
        for t in range(max_t):
            action = agent.act(state, eps)
            next_state, reward, done, _ = env.step(action)
            agent.step(state, action, reward, next_state, done)
            state = next_state
            score += reward
            if done:
                break
        Rewards.append(score)
        Regret+=(500-score)
        scores_window.append(score)

        eps = max(eps_end, eps_decay*eps)
        ''' decrease epsilon '''

        print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)), end="")

        if i_episode % 100 == 0:
           print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)))
    #print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)))

    return Rewards


# Final Draft

In [6]:
#!pip install wandb

In [7]:
import wandb
wandb.login(key = "8545e71f98dc96fbac53295facb12404fc77016d")

[34m[1mwandb[0m: Currently logged in as: [33mnayinisriharsh-iitm[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [8]:
def train_and_tune(config=None):
  # Initialize a new wandb run
  with wandb.init(config=config):
    # If called by wandb.agent, as below,
    # this config will be set by Sweep Controller
    config = wandb.config
    wandb.run.name='al_'+str(config.act_algorithm)+'-bfs_'+str(config.buffer_size)+'-bts_'+str(config.batch_size)+'-epitaust'+str(config.epsilon_tau_start)+'-lr'+str(config.lr)+'-upd'+str(config.update_every)+'-clr'+str(config.num_common_layers)+'-uts'+str(config.num_common_layer_units)+'-eptst'+str(config.epsilon_tau_start)+'-eptd'+str(config.epsilon_tau_decay)
    state_shape = env.observation_space.shape[0]
    action_shape = env.action_space.n
    num_exp = 3
    max_episodes = 500
    total_rewards = np.zeros([num_exp,max_episodes])
    for i in range(num_exp):
      agent = TutorialAgent(state_size=state_shape,action_size = action_shape,seed = i, algo_type=config.type,exploration=config.act_algorithm, BUFFER_SIZE = config.buffer_size, BATCH_SIZE = config.batch_size, GAMMA = 0.99, LR = config.lr, UPDATE_EVERY = config.update_every, num_common_layers =config.num_common_layers,num_common_layer_units= config.num_common_layer_units,num_val_layers=1,num_val_layer_units=config.num_common_layer_units,num_adv_layers=1,num_adv_layer_units=config.num_common_layer_units)
      curr_rewards = dqn(agent,n_episodes=max_episodes,eps_start=config.epsilon_tau_start,eps_end=0.001,eps_decay=config.epsilon_tau_decay)
      total_rewards[i] = curr_rewards
    Regret = np.mean(np.sum(500-total_rewards,axis=1),axis=0)
    #wandb.log({"train_mean_reward":rewards,"train_mean_steps":steps,"test_mean_reward":r,"train_mean_steps":s})
    #data = [[x, y] for (x, y) in zip(np.arange(config.episodes), rewards)]
    #table1 = wandb.Table(data=data, columns=["x", "y"])
    #data = [[x, y] for (x, y) in zip(np.arange(config.episodes), steps)]
    #table2 = wandb.Table(data=data, columns=["x", "y"])
    wandb.log(
        {
            "avg_regret":Regret#,'avg_test_steps':s #,"train_reward": wandb.plot.line(table1, "x", "y", title="Reward vs Episode"),"train_steps": wandb.plot.line(table2, "x", "y", title="Steps vs Episode"),
        }
    )

In [9]:
sweep_config={'method':'bayes',
              'metric' : {
                  'name':'avg_regret',
                  'goal':'minimize'},
              'parameters':{
                  'type':{'values':[1]} ,
                  'act_algorithm':{'values':['softmax','epsilon']},
                  'buffer_size':{'values':[int(1e2),int(5e2),int(1e3),int(1e4)]},
                  'batch_size':{'values':[32,64,128]},
                  'lr':{'values':[1e-2, 1e-3, 1e-4]},
                  'update_every':{'values':[10,20,30]},
                  'num_common_layers':{'values':[1,2,3]},
                  'num_common_layer_units':{'values':[64,128,256]},
                  'epsilon_tau_start':{'values':[1,0.5]},
                  'epsilon_tau_decay':{'values':[0.995,0.95,0.9,0.85]},
                  }}
import pprint
pprint.pprint(sweep_config)
sweep_id=wandb.sweep(sweep_config,project="CS6700_PROGRAMMING_ASSIGNMENT_2")

{'method': 'bayes',
 'metric': {'goal': 'minimize', 'name': 'avg_regret'},
 'parameters': {'act_algorithm': {'values': ['softmax', 'epsilon']},
                'batch_size': {'values': [32, 64, 128]},
                'buffer_size': {'values': [100, 500, 1000, 10000]},
                'epsilon_tau_decay': {'values': [0.995, 0.95, 0.9, 0.85]},
                'epsilon_tau_start': {'values': [1, 0.5]},
                'lr': {'values': [0.01, 0.001, 0.0001]},
                'num_common_layer_units': {'values': [64, 128, 256]},
                'num_common_layers': {'values': [1, 2, 3]},
                'type': {'values': [1]},
                'update_every': {'values': [10, 20, 30]}}}
Create sweep with ID: 3iha1id2
Sweep URL: https://wandb.ai/nayinisriharsh-iitm/CS6700_PROGRAMMING_ASSIGNMENT_2/sweeps/3iha1id2


{'method': 'bayes',
 'metric': {'goal': 'minimize', 'name': 'avg_regret'},
 'parameters': {'act_algorithm': {'values': ['softmax', 'epsilon']},
                'batch_size': {'values': [32, 64, 128]},
                'buffer_size': {'values': [100, 500, 1000, 10000]},
                'epsilon_tau_decay': {'values': [0.995, 0.95, 0.9, 0.85]},
                'epsilon_tau_start': {'values': [1, 0.5]},
                'lr': {'values': [0.01, 0.001, 0.0001]},
                'num_common_layer_units': {'values': [64, 128, 256]},
                'num_common_layers': {'values': [1, 2, 3]},
                'type': {'values': [1]},
                'update_every': {'values': [10, 20, 30]}}}
Create sweep with ID: qxxi8syy
Sweep URL: https://wandb.ai/nayinisriharsh-iitm/CS6700_PROGRAMMING_ASSIGNMENT_2/sweeps/3iha1id2

In [None]:
wandb.agent(sweep_id, train_and_tune,count=50)

[34m[1mwandb[0m: Agent Starting Run: pzmym4i4 with config:
[34m[1mwandb[0m: 	act_algorithm: softmax
[34m[1mwandb[0m: 	batch_size: 64
[34m[1mwandb[0m: 	buffer_size: 100
[34m[1mwandb[0m: 	epsilon_tau_decay: 0.95
[34m[1mwandb[0m: 	epsilon_tau_start: 1
[34m[1mwandb[0m: 	lr: 0.01
[34m[1mwandb[0m: 	num_common_layer_units: 256
[34m[1mwandb[0m: 	num_common_layers: 2
[34m[1mwandb[0m: 	type: 1
[34m[1mwandb[0m: 	update_every: 30


Episode 100	Average Score: 85.87
Episode 200	Average Score: 74.36
Episode 300	Average Score: 20.33
Episode 400	Average Score: 29.36
Episode 500	Average Score: 88.76
Episode 100	Average Score: 59.80
Episode 200	Average Score: 73.56
Episode 300	Average Score: 91.15
Episode 400	Average Score: 93.38
Episode 500	Average Score: 13.52
Episode 100	Average Score: 57.18
Episode 200	Average Score: 29.84
Episode 300	Average Score: 19.44
Episode 400	Average Score: 9.52
Episode 500	Average Score: 10.22


VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
avg_regret,▁

0,1
avg_regret,224790.33333


[34m[1mwandb[0m: Agent Starting Run: 4nr7jdzl with config:
[34m[1mwandb[0m: 	act_algorithm: epsilon
[34m[1mwandb[0m: 	batch_size: 128
[34m[1mwandb[0m: 	buffer_size: 100
[34m[1mwandb[0m: 	epsilon_tau_decay: 0.995
[34m[1mwandb[0m: 	epsilon_tau_start: 1
[34m[1mwandb[0m: 	lr: 0.01
[34m[1mwandb[0m: 	num_common_layer_units: 128
[34m[1mwandb[0m: 	num_common_layers: 1
[34m[1mwandb[0m: 	type: 1
[34m[1mwandb[0m: 	update_every: 10


Episode 100	Average Score: 20.47
Episode 200	Average Score: 13.03
Episode 300	Average Score: 11.42
Episode 400	Average Score: 10.51
Episode 500	Average Score: 10.14
Episode 100	Average Score: 19.88
Episode 200	Average Score: 21.18
Episode 300	Average Score: 20.37
Episode 400	Average Score: 19.69
Episode 500	Average Score: 20.84
Episode 100	Average Score: 20.63
Episode 200	Average Score: 18.22
Episode 300	Average Score: 15.72
Episode 400	Average Score: 15.13
Episode 500	Average Score: 14.69


VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
avg_regret,▁

0,1
avg_regret,241602.66667


[34m[1mwandb[0m: Agent Starting Run: vaa0tzfp with config:
[34m[1mwandb[0m: 	act_algorithm: softmax
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	buffer_size: 10000
[34m[1mwandb[0m: 	epsilon_tau_decay: 0.95
[34m[1mwandb[0m: 	epsilon_tau_start: 0.5
[34m[1mwandb[0m: 	lr: 0.001
[34m[1mwandb[0m: 	num_common_layer_units: 64
[34m[1mwandb[0m: 	num_common_layers: 1
[34m[1mwandb[0m: 	type: 1
[34m[1mwandb[0m: 	update_every: 20


Episode 100	Average Score: 167.02
Episode 200	Average Score: 159.70
Episode 300	Average Score: 9.44
Episode 400	Average Score: 9.78
Episode 500	Average Score: 9.40
Episode 100	Average Score: 92.57
Episode 200	Average Score: 55.50
Episode 300	Average Score: 9.34
Episode 400	Average Score: 9.95
Episode 500	Average Score: 10.40
Episode 100	Average Score: 90.25
Episode 200	Average Score: 16.34
Episode 300	Average Score: 15.67
Episode 400	Average Score: 71.10
Episode 500	Average Score: 19.71


VBox(children=(Label(value='0.001 MB of 0.011 MB uploaded\r'), FloatProgress(value=0.10422582261566728, max=1.…

0,1
avg_regret,▁

0,1
avg_regret,225127.66667


[34m[1mwandb[0m: Agent Starting Run: s5dom6n9 with config:
[34m[1mwandb[0m: 	act_algorithm: softmax
[34m[1mwandb[0m: 	batch_size: 128
[34m[1mwandb[0m: 	buffer_size: 10000
[34m[1mwandb[0m: 	epsilon_tau_decay: 0.995
[34m[1mwandb[0m: 	epsilon_tau_start: 1
[34m[1mwandb[0m: 	lr: 0.01
[34m[1mwandb[0m: 	num_common_layer_units: 64
[34m[1mwandb[0m: 	num_common_layers: 3
[34m[1mwandb[0m: 	type: 1
[34m[1mwandb[0m: 	update_every: 10


Episode 100	Average Score: 53.17
Episode 200	Average Score: 11.53
Episode 300	Average Score: 12.21
Episode 400	Average Score: 186.15
Episode 500	Average Score: 67.30
Episode 100	Average Score: 86.15
Episode 200	Average Score: 112.61
Episode 300	Average Score: 9.43
Episode 400	Average Score: 13.44
Episode 500	Average Score: 61.05
Episode 100	Average Score: 99.11
Episode 200	Average Score: 9.43
Episode 300	Average Score: 38.38
Episode 400	Average Score: 75.03
Episode 500	Average Score: 78.65


VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
avg_regret,▁

0,1
avg_regret,219545.33333


[34m[1mwandb[0m: Agent Starting Run: btm70e3k with config:
[34m[1mwandb[0m: 	act_algorithm: epsilon
[34m[1mwandb[0m: 	batch_size: 128
[34m[1mwandb[0m: 	buffer_size: 1000
[34m[1mwandb[0m: 	epsilon_tau_decay: 0.95
[34m[1mwandb[0m: 	epsilon_tau_start: 1
[34m[1mwandb[0m: 	lr: 0.01
[34m[1mwandb[0m: 	num_common_layer_units: 256
[34m[1mwandb[0m: 	num_common_layers: 1
[34m[1mwandb[0m: 	type: 1
[34m[1mwandb[0m: 	update_every: 10


Episode 100	Average Score: 124.51
Episode 200	Average Score: 156.01
Episode 300	Average Score: 194.68
Episode 400	Average Score: 146.67
Episode 500	Average Score: 174.45
Episode 100	Average Score: 134.49
Episode 200	Average Score: 111.16
Episode 300	Average Score: 156.55
Episode 400	Average Score: 119.87
Episode 500	Average Score: 133.53
Episode 100	Average Score: 140.92
Episode 200	Average Score: 173.71
Episode 300	Average Score: 158.77
Episode 400	Average Score: 155.70
Episode 500	Average Score: 149.12


VBox(children=(Label(value='0.001 MB of 0.012 MB uploaded\r'), FloatProgress(value=0.09964783095886025, max=1.…

0,1
avg_regret,▁

0,1
avg_regret,175662.0


[34m[1mwandb[0m: Agent Starting Run: k8bos58e with config:
[34m[1mwandb[0m: 	act_algorithm: epsilon
[34m[1mwandb[0m: 	batch_size: 64
[34m[1mwandb[0m: 	buffer_size: 10000
[34m[1mwandb[0m: 	epsilon_tau_decay: 0.995
[34m[1mwandb[0m: 	epsilon_tau_start: 0.5
[34m[1mwandb[0m: 	lr: 0.001
[34m[1mwandb[0m: 	num_common_layer_units: 64
[34m[1mwandb[0m: 	num_common_layers: 3
[34m[1mwandb[0m: 	type: 1
[34m[1mwandb[0m: 	update_every: 20


Episode 100	Average Score: 78.88
Episode 200	Average Score: 46.04
Episode 300	Average Score: 18.62
Episode 400	Average Score: 9.52
Episode 500	Average Score: 20.25
Episode 100	Average Score: 85.15
Episode 200	Average Score: 51.70
Episode 300	Average Score: 45.03
Episode 400	Average Score: 9.64
Episode 500	Average Score: 11.29
Episode 100	Average Score: 87.35
Episode 200	Average Score: 80.05
Episode 300	Average Score: 13.55
Episode 400	Average Score: 9.75
Episode 500	Average Score: 76.78


VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
avg_regret,▁

0,1
avg_regret,228546.66667


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: 00qamgwp with config:
[34m[1mwandb[0m: 	act_algorithm: softmax
[34m[1mwandb[0m: 	batch_size: 64
[34m[1mwandb[0m: 	buffer_size: 1000
[34m[1mwandb[0m: 	epsilon_tau_decay: 0.9
[34m[1mwandb[0m: 	epsilon_tau_start: 1
[34m[1mwandb[0m: 	lr: 0.01
[34m[1mwandb[0m: 	num_common_layer_units: 64
[34m[1mwandb[0m: 	num_common_layers: 3
[34m[1mwandb[0m: 	type: 1
[34m[1mwandb[0m: 	update_every: 20


Episode 100	Average Score: 55.60
Episode 200	Average Score: 102.21
Episode 300	Average Score: 83.15
Episode 400	Average Score: 85.89
Episode 500	Average Score: 118.19
Episode 100	Average Score: 80.81
Episode 200	Average Score: 62.72
Episode 300	Average Score: 83.44
Episode 400	Average Score: 79.06
Episode 500	Average Score: 80.82
Episode 100	Average Score: 82.84
Episode 200	Average Score: 75.55
Episode 300	Average Score: 49.33
Episode 400	Average Score: 96.94
Episode 500	Average Score: 93.41


VBox(children=(Label(value='0.012 MB of 0.012 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
avg_regret,▁

0,1
avg_regret,209001.33333


[34m[1mwandb[0m: Agent Starting Run: 5yck26ki with config:
[34m[1mwandb[0m: 	act_algorithm: softmax
[34m[1mwandb[0m: 	batch_size: 64
[34m[1mwandb[0m: 	buffer_size: 1000
[34m[1mwandb[0m: 	epsilon_tau_decay: 0.85
[34m[1mwandb[0m: 	epsilon_tau_start: 0.5
[34m[1mwandb[0m: 	lr: 0.01
[34m[1mwandb[0m: 	num_common_layer_units: 256
[34m[1mwandb[0m: 	num_common_layers: 2
[34m[1mwandb[0m: 	type: 1
[34m[1mwandb[0m: 	update_every: 10


Episode 100	Average Score: 99.51
Episode 200	Average Score: 83.13
Episode 300	Average Score: 67.63
Episode 400	Average Score: 121.63
Episode 500	Average Score: 112.75
Episode 100	Average Score: 117.83
Episode 200	Average Score: 62.85
Episode 300	Average Score: 57.19
Episode 400	Average Score: 71.41
Episode 500	Average Score: 25.43
Episode 100	Average Score: 96.18
Episode 200	Average Score: 84.35
Episode 300	Average Score: 68.87
Episode 400	Average Score: 91.75
Episode 500	Average Score: 77.50


VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
avg_regret,▁

0,1
avg_regret,208733.0


[34m[1mwandb[0m: Agent Starting Run: or1c7zyn with config:
[34m[1mwandb[0m: 	act_algorithm: epsilon
[34m[1mwandb[0m: 	batch_size: 64
[34m[1mwandb[0m: 	buffer_size: 1000
[34m[1mwandb[0m: 	epsilon_tau_decay: 0.85
[34m[1mwandb[0m: 	epsilon_tau_start: 0.5
[34m[1mwandb[0m: 	lr: 0.001
[34m[1mwandb[0m: 	num_common_layer_units: 128
[34m[1mwandb[0m: 	num_common_layers: 3
[34m[1mwandb[0m: 	type: 1
[34m[1mwandb[0m: 	update_every: 10


Episode 100	Average Score: 81.04
Episode 200	Average Score: 96.05
Episode 300	Average Score: 110.06
Episode 400	Average Score: 52.08
Episode 500	Average Score: 72.54
Episode 100	Average Score: 81.81
Episode 200	Average Score: 73.90
Episode 300	Average Score: 105.00
Episode 400	Average Score: 166.96
Episode 500	Average Score: 213.71
Episode 100	Average Score: 237.28
Episode 200	Average Score: 114.95
Episode 300	Average Score: 227.53
Episode 400	Average Score: 135.56
Episode 500	Average Score: 145.68


VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
avg_regret,▁

0,1
avg_regret,186195.0


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: ftyvqdea with config:
[34m[1mwandb[0m: 	act_algorithm: softmax
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	buffer_size: 1000
[34m[1mwandb[0m: 	epsilon_tau_decay: 0.95
[34m[1mwandb[0m: 	epsilon_tau_start: 0.5
[34m[1mwandb[0m: 	lr: 0.01
[34m[1mwandb[0m: 	num_common_layer_units: 256
[34m[1mwandb[0m: 	num_common_layers: 1
[34m[1mwandb[0m: 	type: 1
[34m[1mwandb[0m: 	update_every: 10


Episode 100	Average Score: 50.21
Episode 200	Average Score: 26.34
Episode 300	Average Score: 22.59
Episode 400	Average Score: 35.00
Episode 500	Average Score: 35.92
Episode 100	Average Score: 41.41
Episode 200	Average Score: 30.31
Episode 300	Average Score: 28.91
Episode 400	Average Score: 15.89
Episode 500	Average Score: 39.61
Episode 100	Average Score: 37.08
Episode 200	Average Score: 19.81
Episode 300	Average Score: 26.07
Episode 400	Average Score: 19.05
Episode 500	Average Score: 29.22


VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
avg_regret,▁

0,1
avg_regret,234752.66667


[34m[1mwandb[0m: Agent Starting Run: u2haujgi with config:
[34m[1mwandb[0m: 	act_algorithm: softmax
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	buffer_size: 100
[34m[1mwandb[0m: 	epsilon_tau_decay: 0.9
[34m[1mwandb[0m: 	epsilon_tau_start: 0.5
[34m[1mwandb[0m: 	lr: 0.0001
[34m[1mwandb[0m: 	num_common_layer_units: 256
[34m[1mwandb[0m: 	num_common_layers: 2
[34m[1mwandb[0m: 	type: 1
[34m[1mwandb[0m: 	update_every: 20


Episode 100	Average Score: 171.87
Episode 200	Average Score: 124.19
Episode 300	Average Score: 316.65
Episode 400	Average Score: 303.56
Episode 500	Average Score: 265.21
Episode 100	Average Score: 127.87
Episode 200	Average Score: 220.55
Episode 300	Average Score: 282.75
Episode 400	Average Score: 283.01
Episode 500	Average Score: 281.65
Episode 100	Average Score: 62.42
Episode 200	Average Score: 67.91
Episode 300	Average Score: 403.85
Episode 400	Average Score: 454.27
Episode 500	Average Score: 419.12


VBox(children=(Label(value='0.001 MB of 0.012 MB uploaded\r'), FloatProgress(value=0.09759149202377229, max=1.…

0,1
avg_regret,▁

0,1
avg_regret,123837.33333


[34m[1mwandb[0m: Agent Starting Run: vqvh6msq with config:
[34m[1mwandb[0m: 	act_algorithm: softmax
[34m[1mwandb[0m: 	batch_size: 64
[34m[1mwandb[0m: 	buffer_size: 1000
[34m[1mwandb[0m: 	epsilon_tau_decay: 0.995
[34m[1mwandb[0m: 	epsilon_tau_start: 0.5
[34m[1mwandb[0m: 	lr: 0.01
[34m[1mwandb[0m: 	num_common_layer_units: 64
[34m[1mwandb[0m: 	num_common_layers: 2
[34m[1mwandb[0m: 	type: 1
[34m[1mwandb[0m: 	update_every: 30


Episode 100	Average Score: 44.48
Episode 200	Average Score: 157.77
Episode 300	Average Score: 58.76
Episode 400	Average Score: 58.72
Episode 500	Average Score: 30.79
Episode 100	Average Score: 74.67
Episode 200	Average Score: 142.53
Episode 300	Average Score: 120.64
Episode 400	Average Score: 58.93
Episode 500	Average Score: 100.95
Episode 100	Average Score: 55.46
Episode 200	Average Score: 71.53
Episode 300	Average Score: 57.75
Episode 400	Average Score: 61.38
Episode 500	Average Score: 52.53


VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
avg_regret,▁

0,1
avg_regret,211770.33333


[34m[1mwandb[0m: Agent Starting Run: fnjezbda with config:
[34m[1mwandb[0m: 	act_algorithm: epsilon
[34m[1mwandb[0m: 	batch_size: 64
[34m[1mwandb[0m: 	buffer_size: 500
[34m[1mwandb[0m: 	epsilon_tau_decay: 0.85
[34m[1mwandb[0m: 	epsilon_tau_start: 1
[34m[1mwandb[0m: 	lr: 0.0001
[34m[1mwandb[0m: 	num_common_layer_units: 64
[34m[1mwandb[0m: 	num_common_layers: 3
[34m[1mwandb[0m: 	type: 1
[34m[1mwandb[0m: 	update_every: 30


Episode 100	Average Score: 27.64
Episode 200	Average Score: 154.30
Episode 300	Average Score: 397.00
Episode 400	Average Score: 402.13
Episode 500	Average Score: 338.88
Episode 100	Average Score: 10.13
Episode 200	Average Score: 58.20
Episode 300	Average Score: 267.40
Episode 400	Average Score: 250.10
Episode 500	Average Score: 212.18
Episode 100	Average Score: 28.52
Episode 200	Average Score: 148.66
Episode 300	Average Score: 297.12
Episode 400	Average Score: 200.96
Episode 500	Average Score: 332.69


VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
avg_regret,▁

0,1
avg_regret,145803.0


[34m[1mwandb[0m: Agent Starting Run: 6b3h1bu6 with config:
[34m[1mwandb[0m: 	act_algorithm: softmax
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	buffer_size: 10000
[34m[1mwandb[0m: 	epsilon_tau_decay: 0.995
[34m[1mwandb[0m: 	epsilon_tau_start: 0.5
[34m[1mwandb[0m: 	lr: 0.01
[34m[1mwandb[0m: 	num_common_layer_units: 128
[34m[1mwandb[0m: 	num_common_layers: 1
[34m[1mwandb[0m: 	type: 1
[34m[1mwandb[0m: 	update_every: 30


Episode 100	Average Score: 63.90
Episode 200	Average Score: 25.33
Episode 300	Average Score: 14.75
Episode 400	Average Score: 9.39
Episode 500	Average Score: 9.27
Episode 100	Average Score: 64.04
Episode 200	Average Score: 9.36
Episode 300	Average Score: 11.78
Episode 400	Average Score: 10.42
Episode 500	Average Score: 10.57
Episode 100	Average Score: 69.00
Episode 200	Average Score: 9.59
Episode 300	Average Score: 67.21
Episode 400	Average Score: 9.31
Episode 500	Average Score: 9.35


VBox(children=(Label(value='0.001 MB of 0.011 MB uploaded\r'), FloatProgress(value=0.10377515383336106, max=1.…

0,1
avg_regret,▁

0,1
avg_regret,236891.0


[34m[1mwandb[0m: Agent Starting Run: bc1gtgka with config:
[34m[1mwandb[0m: 	act_algorithm: epsilon
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	buffer_size: 100
[34m[1mwandb[0m: 	epsilon_tau_decay: 0.85
[34m[1mwandb[0m: 	epsilon_tau_start: 0.5
[34m[1mwandb[0m: 	lr: 0.0001
[34m[1mwandb[0m: 	num_common_layer_units: 256
[34m[1mwandb[0m: 	num_common_layers: 3
[34m[1mwandb[0m: 	type: 1
[34m[1mwandb[0m: 	update_every: 20


Episode 100	Average Score: 41.34
Episode 200	Average Score: 311.77
Episode 300	Average Score: 309.33
Episode 400	Average Score: 365.56
Episode 500	Average Score: 298.43
Episode 100	Average Score: 29.54
Episode 200	Average Score: 287.13
Episode 300	Average Score: 381.19
Episode 400	Average Score: 313.31
Episode 411	Average Score: 311.13