# SAC (Soft Actor-Critic)

SAC (Soft Actor-Critic) is one of today's popular algorithm, which is based on **off-policy** DDPG discussed in [here](./05-ddpg.ipynb).<br>
However, unlike DDPG, SAC applies entropy regularization and trains a stochastic policy, not a deterministic policy.

Entropy is defined as $ H(P) = -\int P(x) \log P(x) = E_x[-\log P(x)] $ and it means how $ P(\cdot) $ is distributed intuitively.<br>
For instance, if it has 8 possible states, each of which is equally likely in discrete distribution, it will have $ H(P) = -\sum P(x) \log P(x) = -8 \times \frac{1}{8} \log_2 \frac{1}{8} = 3 $. (This implies that it needs 3 bits evenly.) If the distribution is $ (\frac{1}{2}, \frac{1}{4}, \frac{1}{8}, \frac{1}{16}, \frac{1}{64}, \frac{1}{64}, \frac{1}{64}, \frac{1}{64}) $, it will have $ H(P) = -\frac{1}{2} \log_2 \frac{1}{2} - \frac{1}{4} \log_2 \frac{1}{4} - \frac{1}{8} \log_2 \frac{1}{8} - \frac{1}{16} \log_2 \frac{1}{16} - 4 \times \frac{1}{64} \log_2 \frac{1}{64} = 2 $.<br>
(Note that, for simplicity, I have replaced the base e of logarithm with 2.)<br>

As you can see above, entropy will be larger, when the distribution has much randomness.

> Note : In continuous distribution, it's known that the distribution that maximizes the entropy is Gaussian distribution. Here I don't go into details, but KL-divergence (the penalty for large updates) discussed in [PPO](./04-ppo.ipynb) is closely related with this entropy term.

In SAC, instead of using a reward expectation $ r_t + \gamma (d_t - 1) Q_{{\phi}^{\prime}} $ used in DDPG, it applies $ r_t + \gamma (d_t - 1) (Q_{{\phi}^{\prime}} + \alpha H(P)) $ (where $\alpha$ is a coefficient parameter for entropy weight, called entropy temperature) in order to balance between exploitation and exploration.<br>
Even if the estimated Q-value increases, it might be rejected when the entropy is largely reduced.

*(back to [index](https://github.com/tsmatz/reinforcement-learning-tutorials/))*

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import random
# import gym 
import numpy as np
import torch
import torch.nn as nn
from torch.nn import functional as F

Unlike [DDPG](./05-ddpg.ipynb), we can use discrete action space in SAC. (See below for this reason.)<br>
We then now use standard CartPole agent in Gym.

Unlike DDPG, SAC trains a stochastic policy $ \pi_{\theta}(\cdot) $ (where $ \theta $ is parameters) instead of a deterministic policy $ \mu_{\theta}(\cdot) $. (And we don't use target policy network $ \pi_{\theta^{\prime}} $.)<br>
In this example, I use categorical distribution (same as, used in [policy gradient](./02-policy-gradient.ipynb) and [PPO](./04-ppo.ipynb) example) for a policy $ P(\cdot | \pi_\theta(s)) $, because it's discrete action space:

> Note : For the bounded continuous action space between $ l $ and $ h $, use Gaussian distribution as follows.<br>
> $ P(\cdot | \pi_\theta(s)) = ((tanh(\mathcal{N}(\mu_{\theta}(s), \sigma_{\theta}(s))) + 1.0) / 2.0) \times (h - l) + l  $

Because we use a stochastic policy, we don't then need Ornstein-Uhlenbeck noise used in [DDPG](./05-ddpg.ipynb) any more.

In [139]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Policy net (pi_theta)
class PolicyNet(nn.Module):
    def __init__(self, hidden_dim=64, state_dim = 4, nActions = 20):
        super().__init__()

        self.hidden = nn.Linear(state_dim, hidden_dim)
        self.output = nn.Linear(hidden_dim, 2*nActions) # 2 * number + 1

    def forward(self, s):
        outs = self.hidden(s)
        outs = F.relu(outs)
        outs = self.output(outs)
        return outs

# pi_model = PolicyNet(nActions=20).to(device)

# Pick up action (for each step in episode)
def pick_sample(s, pi_model):
    with torch.no_grad():
        #   --> size : (1, 4)
        s_batch = np.expand_dims(s, axis=0)
        s_batch = torch.tensor(s_batch, dtype=torch.float).to(device)
        # Get logits from state
        #   --> size : (1, 2)
        logits = pi_model(s_batch)
        #   --> size : (2)
        logits = logits.squeeze(dim=0)
        # From logits to probabilities
        probs = F.softmax(logits, dim=-1)
        # Pick up action's sample
        #   --> size : (1)
        a = torch.multinomial(probs, num_samples=1)
        #   --> size : ()
        a = a.squeeze(dim=0)
        # Return
        return a.tolist()

Same as clipped double-Q (twin-Q) DDPG method (see the latter part in [here](./05-ddpg.ipynb)), we use 2 Q-networks - $ Q_{\phi_1}(s), Q_{\phi_2}(s) $ - and corresponding 2 target networks - $ Q_{\phi_1^{\prime}}(s), Q_{\phi_2^{\prime}}(s) $.

You will find that this is different from the one used in [DDPG](./05-ddpg.ipynb). (In DDPG, we have used $Q(s, a)$.)<br>
For categorical distribution with the depth n (in this example, n=2), the output of $ Q(\cdot) $ is n-dimensional tensor, in which each element represents the expectation of Q-value for the corresponding action. And we then use $ Q(s) \cdot \tilde{a} $ (i.e, dot product operation) instead of $ Q(s, a) $, where $ \tilde{a} $ is one hot tensor for action $ a $.<br>
Because of this reason, we use $Q(s)$ instead of $Q(s, a)$.

In [140]:
class QNet(nn.Module):
    def __init__(self, state_dim = 4, hidden_dim=64, nActions = 20):
        super().__init__()

        self.hidden = nn.Linear(state_dim, hidden_dim)
        self.output = nn.Linear(hidden_dim, 2*nActions) # 2 * number + 1

    def forward(self, s):
        outs = self.hidden(s)
        outs = F.relu(outs)
        outs = self.output(outs)
        return outs

# q_origin_model1 = QNet(nActions=20).to(device)  # Q_phi1
# q_origin_model2 = QNet(nActions=20).to(device)  # Q_phi2
# q_target_model1 = QNet(nActions=20).to(device)  # Q_phi1'
# q_target_model2 = QNet(nActions=20).to(device)  # Q_phi2'
# _ = q_target_model1.requires_grad_(False)  # target model doen't need grad
# _ = q_target_model2.requires_grad_(False)  # target model doen't need grad

As we saw in [clipped double-Q DDPG](./05-ddpg.ipynb), we optimize policy parameter $ \theta $ to maximize $ Q_{\phi_1}(s, a^*) + \alpha H(P(\cdot | \pi_\theta(s))) $ where $ a^* $ is an optimal action.

As I have mentioned above, $ H(P) = E_x[-\log P(x)] $.<br>
In this categorical distribution (in discrete action space), $ H $ will then be the following dot product :

$ H(P) = H(P(\cdot | \pi_\theta(s))) = -\pi_\theta(s) \cdot \log \pi_\theta(s) $

where $ \pi_\theta(s) $ is one hot probability.

For $ Q(s, a^*) $ term, it will become the following dot product. (See above for this reason.) :

$ Q_{\phi_1}(s, a^*) = Q_{\phi_1}(s) \cdot \pi_\theta(s) $

To summarize, we should optimize $ \theta $ to maximize :

$ E\left[ \pi_\theta(s) \cdot Q_{\phi_1}(s) - \alpha \pi_\theta(s) \cdot \log \pi_\theta(s) \right] = E\left[ \pi_\theta(s) \cdot (Q_{\phi_1}(s) - \alpha \log \pi_\theta(s)) \right] $

> Note : Here I have used a constant $ \alpha $, but the appropriate temperature ($ \alpha $) depends on the magnitude of rewards, and it's not so easy to determine appropriate temprature, because it also depends on policy, which improves over time during training.<br>
> There exists a variation of SAC, in which $ \alpha $ is also learned over the course of training to align to appropriate entropy.

In [141]:
alpha = 0.4
# alpha = 0.1


class categorical:
    def __init__(self, s):
        logits = pi_model(s)
        self._prob = F.softmax(logits, dim=-1)
        self._logp = torch.log(self._prob)

    # probability (sum is 1.0) : P
    def prob(self):
        return self._prob

    # log probability : log P()
    def logp(self):
        return self._logp

def optimize_theta(states, alpha):
    # Convert to tensor
    states = torch.tensor(states, dtype=torch.float).to(device)
    # Disable grad in q_origin_model1 before computation
    # (or use q_value.detach() not to include in graph)
    for p in q_origin_model1.parameters():
        p.requires_grad = False
    # Optimize
    opt_pi.zero_grad()
    dist = categorical(states)
    q_value = q_origin_model1(states)
    term1 = dist.prob()
    # alpha = log_alpha.exp().detach()
    term2 = q_value - alpha * dist.logp()
    # print(term1.shape, term2.shape)
    # return
    expectation = term1.unsqueeze(dim=1) @ term2.unsqueeze(dim=2)
    expectation = expectation.squeeze(dim=1)
    (-expectation).sum().backward()
    opt_pi.step()
    # Enable grad again
    for p in q_origin_model1.parameters():
        p.requires_grad = True
        
def optimize_alpha(states):
    states = torch.tensor(states, dtype=torch.float).to(device)
    # Disable grad in q_origin_model1 before computation
    # (or use q_value.detach() not to include in graph)
    for p in pi_model.parameters():
        p.requires_grad = False
    
    dist = categorical(states)
    alpha_optimizer.zero_grad()
    alphaLoss = - (log_alpha.exp() * ((dist.logp() * dist.prob()).sum() - targetEntropy)).mean()
    alphaLoss.backward()
    alpha_optimizer.step()
    alpha = log_alpha.exp().detach()
    for p in pi_model.parameters():
        p.requires_grad = True

Same as we saw in [clipped double-Q DDPG](./05-ddpg.ipynb), we optimize parameter $ \phi_1, \phi_2 $ as follows :

- Optimize $ \phi_1 $ to minimize $ E\left[ \left( Q_{\phi_1}(s_t, a_t) - \left( r_t + \gamma (1 - d_t) \left( \min_{i=1,2} Q_{{\phi_i}^{\prime}}(s_{t+1},a^*_{t+1}) + \alpha H(P(\cdot | \pi_\theta(s_{t+1}))) \right) \right) \right)^2 \right] $
- Optimize $ \phi_2 $ to minimize $ E\left[ \left( Q_{\phi_2}(s_t, a_t) - \left( r_t + \gamma (1 - d_t) \left( \min_{i=1,2} Q_{{\phi_i}^{\prime}}(s_{t+1},a^*_{t+1}) + \alpha H(P(\cdot | \pi_\theta(s_{t+1}))) \right) \right) \right)^2 \right] $

in which :

- $ Q_{\phi_i}(s_t, a_t) = Q_{\phi_i}(s_t) \cdot \tilde{a_t} $ where $ \tilde{a_t} $ is one hot vector of $ a_t $
- $ Q_{{\phi_i}^{\prime}}(s_{t+1},a^*_{t+1}) =  Q_{\phi_i^{\prime}}(s_{t+1}) \cdot \pi_\theta(s_{t+1}) $ where $ \pi_\theta(s_{t+1}) $ is one hot probability
- $ H(P(\cdot | \pi_\theta(s_{t+1}))) = -\pi_\theta(s_{t+1}) \cdot \log \pi_\theta(s_{t+1}) $

In [85]:
gamma = 0.99


def optimize_phi(states, actions, rewards, next_states, dones, alpha, nActions=20):
    #
    # Convert to tensor
    #
    states = torch.tensor(states, dtype=torch.float).to(device)
    actions = torch.tensor(actions, dtype=torch.int64).to(device)
    rewards = torch.tensor(rewards, dtype=torch.float).to(device)
    rewards = rewards.unsqueeze(dim=1)
    next_states = torch.tensor(next_states, dtype=torch.float).to(device)
    dones = torch.tensor(dones, dtype=torch.float).to(device)
    dones = dones.unsqueeze(dim=1)

    #
    # Compute r + gamma * (1 - d) (min Q(s_next,a_next') + alpha * H(P))
    #
    # alpha = log_alpha.exp().detach()
    with torch.no_grad():
        # min Q(s_next,a_next')
        q1_tgt_next = q_target_model1(next_states)
        q2_tgt_next = q_target_model2(next_states)
        dist_next = categorical(next_states)
        q1_target = q1_tgt_next.unsqueeze(dim=1) @ dist_next.prob().unsqueeze(dim=2)
        q1_target = q1_target.squeeze(dim=1)
        q2_target = q2_tgt_next.unsqueeze(dim=1) @ dist_next.prob().unsqueeze(dim=2)
        q2_target = q2_target.squeeze(dim=1)
        q_target_min = torch.minimum(q1_target, q2_target)
        # alpha * H(P)
        h = dist_next.prob().unsqueeze(dim=1) @ dist_next.logp().unsqueeze(dim=2)
        h = h.squeeze(dim=1)
        h = -alpha * h
        # total
        term2 = rewards + gamma * (1.0 - dones) * (q_target_min + h)

    #
    # Optimize critic loss for Q-network1
    #
    opt_q1.zero_grad()
    one_hot_actions = F.one_hot(actions, num_classes=2*nActions).float()
    q_value1 = q_origin_model1(states)
    term1 = q_value1.unsqueeze(dim=1) @ one_hot_actions.unsqueeze(dim=2)
    term1 = term1.squeeze(dim=1)
    loss_q1 = F.mse_loss(
        term1,
        term2,
        reduction="none")
    loss_q1.sum().backward()
    opt_q1.step()

    #
    # Optimize critic loss for Q-network2
    #
    opt_q2.zero_grad()
    one_hot_actions = F.one_hot(actions, num_classes=2*nActions).float()
    q_value2 = q_origin_model2(states)
    term1 = q_value2.unsqueeze(dim=1) @ one_hot_actions.unsqueeze(dim=2)
    term1 = term1.squeeze(dim=1)
    loss_q2 = F.mse_loss(
        term1,
        term2,
        reduction="none")
    loss_q2.sum().backward()
    opt_q2.step()

As we saw in [clipped double-Q DDPG](./05-ddpg.ipynb), target parameters $\phi_1^{\prime}, \phi_2^{\prime}$ are delayed with coefficient parameter (hyper-parameter) $ \tau $.

In [7]:
tau = 0.002

def update_target():
    for var, var_target in zip(q_origin_model1.parameters(), q_target_model1.parameters()):
        var_target.data = tau * var.data + (1.0 - tau) * var_target.data
    for var, var_target in zip(q_origin_model2.parameters(), q_target_model2.parameters()):
        var_target.data = tau * var.data + (1.0 - tau) * var_target.data

As we saw in [DDPG](./05-ddpg.ipynb), we use replay buffer to prevent from learning only for recent experiences.

In [8]:
class replayBuffer:
    def __init__(self, buffer_size: int):
        self.buffer_size = buffer_size
        self.buffer = []
        self._next_idx = 0

    def add(self, item):
        if len(self.buffer) > self._next_idx:
            self.buffer[self._next_idx] = item
        else:
            self.buffer.append(item)
        if self._next_idx == self.buffer_size - 1:
            self._next_idx = 0
        else:
            self._next_idx = self._next_idx + 1

    def sample(self, batch_size):
        indices = [random.randint(0, len(self.buffer) - 1) for _ in range(batch_size)]
        states   = [self.buffer[i][0] for i in indices]
        actions  = [self.buffer[i][1] for i in indices]
        rewards  = [self.buffer[i][2] for i in indices]
        n_states = [self.buffer[i][3] for i in indices]
        dones    = [self.buffer[i][4] for i in indices]
        return states, actions, rewards, n_states, dones

    def length(self):
        return len(self.buffer)

buffer = replayBuffer(20000)

Now let's put it all together !

In [9]:
import pickle
with open('inputTestCases/_input2ways_n=4_.pickle', 'rb') as f:
    roadDefs = pickle.load(f) # deserialize using load()


In [175]:
from junctionart.roundabout.encodingGFN.setGenerationEnv import SetGenerationEnv
size = 4
nActions = 30

# # models
# pi_model = PolicyNet(state_dim=size, nActions=nActions).to(device)
# q_origin_model1 = QNet(state_dim=size, nActions=nActions).to(device)  # Q_phi1
# q_origin_model2 = QNet(state_dim=size, nActions=nActions).to(device)  # Q_phi2
# q_target_model1 = QNet(state_dim=size, nActions=nActions).to(device)  # Q_phi1'
# q_target_model2 = QNet(state_dim=size, nActions=nActions).to(device)  # Q_phi2'
# _ = q_target_model1.requires_grad_(False)  # target model doen't need grad
# _ = q_target_model2.requires_grad_(False)  # target model doen't need grad
# buffer = replayBuffer(20000)

# # optimizers
# opt_pi = torch.optim.AdamW(pi_model.parameters(), lr=0.0005)
# opt_q1 = torch.optim.AdamW(q_origin_model1.parameters(), lr=0.0005)
# opt_q2 = torch.optim.AdamW(q_origin_model2.parameters(), lr=0.0005)

def train(env, nIter = 6000, batch_size = 250, disableBar = False):
    for i in tqdm(range(nIter), disable = disableBar):
        # Run episode till done
        s = torch.zeros(1, size)
        done = False
        cum_reward = 0
        while not done:
            a = pick_sample((s/nActions).squeeze().tolist(), pi_model)

            s_next = env.update(s, torch.tensor([a]), inPlace = False)
            
            done = (s_next != 0).all().item()
            if done:
                config = (s_next.squeeze() - 1).long().tolist()
                r = 10**env.getProxyReward(config, normalize=True)
         
            else:
                r = 0
            buffer.add([(s/nActions).squeeze().tolist(), a, r, (s_next/nActions).squeeze().tolist(), float(done)])
            cum_reward += r
            if buffer.length() >= 3000:
                states, actions, rewards, n_states, dones = buffer.sample(batch_size)
                optimize_theta(states, alpha)
                # optimize_alpha(states)
                
                optimize_phi(states, actions, rewards, n_states, dones, alpha, nActions=nActions)
                update_target()
            s = s_next
        
            # alpha = log_alpha.exp().detach()
        print("Run episode{} with rewards {} s {} ALPHA {}".format(i, cum_reward, s.squeeze().tolist(), alpha), end="\r")
    

In [161]:
from tqdm import tqdm
def sampleRewardWithConfig(nIter, setEnv, pi_model):
    rewardWithConfigs = []
    for i in tqdm(range(nIter)):
        # Run episode till done
        s = torch.zeros(1, size)
        done = False

        while not done:
            a = pick_sample((s/nActions).squeeze().tolist(), pi_model)

            s_next = env.update(s, torch.tensor([a]))
                 
            done = (s_next != 0).all().item()
            
            if done:
                config = (s_next.squeeze() - 1).long().tolist()
                r = env.getProxyReward(config, normalize=True) 
           
                rewardWithConfigs.append((r, config))
            s = s_next
    return rewardWithConfigs

def getTopK(rewardWithConfigs, K):
    modes = []
    proxyRewards = []
    
    rewardWithConfigs.sort(key = lambda x : x[0], reverse=True)

    for reward, config in rewardWithConfigs[:K]: # top-500 samples
        modes.append(config)
        proxyRewards.append(reward)

    
    return modes, proxyRewards

In [162]:
from junctionart.roundabout.encodingGFN.RoundaboutLaneEncodingEnv import RoundaboutLaneEncodingEnv
from junctionart.roundabout.RewardUtil import RewardUtil

def getRoundabouts(roadDefinition, modes):
    env = RoundaboutLaneEncodingEnv()
    roundabouts = []
    for i in tqdm(range(len(modes))):
        env.generateWithRoadDefinition(
            roadDefinition=roadDefinition,
            outgoingLanesMerge=False,
            nSegments=nActions,
            laneToCircularId=modes[i]
        )
        roundabouts.append(env.getRoundabout())
    return roundabouts

def getRewards(roundabouts):
    rewards = [roundabout.getReward() for roundabout in roundabouts]
    return rewards

def getDiversityScore(roundabouts):
    distances = []
    for i in tqdm(range(len(roundabouts))):
        for j in range(i + 1, len(roundabouts)):
            distance = RewardUtil.getDistance(roundabouts[i], roundabouts[j])
            distances.append(distance)

    distances = np.array(distances)
    return distances.sum() / (len(roundabouts) * (len(roundabouts) - 1))

In [163]:
scoresList = []
diversityScores = []
allRoundabouts = []
output = {"roundabouts" : [], "modes" : [], "proxyRewards" : []}
alpha = 0.4

for roadDefinition in roadDefs:
    env = SetGenerationEnv(size, nActions, roadDefinition)

    targetEntropy = -nActions
    log_alpha = torch.tensor([0.0], requires_grad=True)

    # models
    pi_model = PolicyNet(state_dim=size, nActions=nActions).to(device)
    q_origin_model1 = QNet(state_dim=size, nActions=nActions).to(device)  # Q_phi1
    q_origin_model2 = QNet(state_dim=size, nActions=nActions).to(device)  # Q_phi2
    q_target_model1 = QNet(state_dim=size, nActions=nActions).to(device)  # Q_phi1'
    q_target_model2 = QNet(state_dim=size, nActions=nActions).to(device)  # Q_phi2'
    _ = q_target_model1.requires_grad_(False)  # target model doen't need grad
    _ = q_target_model2.requires_grad_(False)  # target model doen't need grad
    buffer = replayBuffer(20000)

    # optimizers
    opt_pi = torch.optim.AdamW(pi_model.parameters(), lr=0.0005)
    opt_q1 = torch.optim.AdamW(q_origin_model1.parameters(), lr=0.0005)
    opt_q2 = torch.optim.AdamW(q_origin_model2.parameters(), lr=0.0005)
    alpha_optimizer = torch.optim.AdamW(params=[log_alpha], lr=0.0005) 
    doneTraining = False
    while not doneTraining:
        try:
            train(env, nIter=1700, batch_size=256)
            doneTraining = True
        except (ValueError, RuntimeError):
            print("Error , trying again.")
        
    rewardsWithConfigs = sampleRewardWithConfig(10**4, setEnv=env, pi_model=pi_model)
    modes, proxyRewards = getTopK(rewardsWithConfigs, 200)
    roundabouts = getRoundabouts(roadDefinition, modes)
    
    output["roundabouts"].append(roundabouts)
    output["modes"].append(modes)
    output["proxyRewards"].append(proxyRewards)
    print(log_alpha.exp())
    # rewards = getRewards(roundabouts)
    # scoresList.append(rewards)
import pickle
with open('analysis/expSAC_N=4_K=200.pkl', 'wb') as file:
    pickle.dump(output, file)

100%|███████████████████████████████████████████████████████████████████████████████████████| 1700/1700 [00:49<00:00, 34.23it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:21<00:00, 462.85it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:50<00:00,  3.97it/s]


tensor([1.], grad_fn=<ExpBackward0>)


100%|███████████████████████████████████████████████████████████████████████████████████████| 1700/1700 [00:48<00:00, 35.28it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:21<00:00, 463.01it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:50<00:00,  3.96it/s]


tensor([1.], grad_fn=<ExpBackward0>)


100%|███████████████████████████████████████████████████████████████████████████████████████| 1700/1700 [00:57<00:00, 29.73it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:22<00:00, 437.99it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████| 200/200 [04:00<00:00,  1.20s/it]


tensor([1.], grad_fn=<ExpBackward0>)


100%|███████████████████████████████████████████████████████████████████████████████████████| 1700/1700 [00:50<00:00, 33.43it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:21<00:00, 471.41it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:48<00:00,  4.09it/s]


tensor([1.], grad_fn=<ExpBackward0>)


100%|███████████████████████████████████████████████████████████████████████████████████████| 1700/1700 [00:46<00:00, 36.73it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:20<00:00, 486.31it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████| 200/200 [01:18<00:00,  2.55it/s]


tensor([1.], grad_fn=<ExpBackward0>)


100%|███████████████████████████████████████████████████████████████████████████████████████| 1700/1700 [00:59<00:00, 28.63it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:23<00:00, 417.54it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:50<00:00,  3.95it/s]


tensor([1.], grad_fn=<ExpBackward0>)


100%|███████████████████████████████████████████████████████████████████████████████████████| 1700/1700 [01:25<00:00, 19.83it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:21<00:00, 461.88it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:50<00:00,  3.95it/s]


tensor([1.], grad_fn=<ExpBackward0>)


100%|███████████████████████████████████████████████████████████████████████████████████████| 1700/1700 [00:43<00:00, 39.40it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:17<00:00, 568.73it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:42<00:00,  4.70it/s]


tensor([1.], grad_fn=<ExpBackward0>)


100%|███████████████████████████████████████████████████████████████████████████████████████| 1700/1700 [00:42<00:00, 39.76it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:17<00:00, 575.47it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████| 200/200 [03:02<00:00,  1.10it/s]


tensor([1.], grad_fn=<ExpBackward0>)


100%|███████████████████████████████████████████████████████████████████████████████████████| 1700/1700 [00:33<00:00, 50.18it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:12<00:00, 799.18it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:31<00:00,  6.44it/s]


tensor([1.], grad_fn=<ExpBackward0>)


100%|███████████████████████████████████████████████████████████████████████████████████████| 1700/1700 [00:35<00:00, 47.67it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:12<00:00, 786.15it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████| 200/200 [02:50<00:00,  1.17it/s]


tensor([1.], grad_fn=<ExpBackward0>)


100%|███████████████████████████████████████████████████████████████████████████████████████| 1700/1700 [00:32<00:00, 51.61it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:12<00:00, 791.53it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:52<00:00,  3.83it/s]


tensor([1.], grad_fn=<ExpBackward0>)


100%|███████████████████████████████████████████████████████████████████████████████████████| 1700/1700 [00:34<00:00, 49.36it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:12<00:00, 784.64it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:31<00:00,  6.28it/s]


tensor([1.], grad_fn=<ExpBackward0>)


100%|███████████████████████████████████████████████████████████████████████████████████████| 1700/1700 [00:33<00:00, 50.03it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:12<00:00, 799.68it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████| 200/200 [03:07<00:00,  1.07it/s]


tensor([1.], grad_fn=<ExpBackward0>)


100%|███████████████████████████████████████████████████████████████████████████████████████| 1700/1700 [00:40<00:00, 42.35it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:17<00:00, 581.93it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████| 200/200 [01:09<00:00,  2.86it/s]


tensor([1.], grad_fn=<ExpBackward0>)


100%|███████████████████████████████████████████████████████████████████████████████████████| 1700/1700 [00:44<00:00, 38.59it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:13<00:00, 731.03it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:30<00:00,  6.46it/s]


tensor([1.], grad_fn=<ExpBackward0>)


100%|███████████████████████████████████████████████████████████████████████████████████████| 1700/1700 [00:35<00:00, 47.80it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:15<00:00, 653.29it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:40<00:00,  4.93it/s]


tensor([1.], grad_fn=<ExpBackward0>)


100%|███████████████████████████████████████████████████████████████████████████████████████| 1700/1700 [00:35<00:00, 48.27it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:13<00:00, 766.77it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:33<00:00,  5.97it/s]


tensor([1.], grad_fn=<ExpBackward0>)


100%|███████████████████████████████████████████████████████████████████████████████████████| 1700/1700 [00:40<00:00, 41.53it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:18<00:00, 538.00it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████| 200/200 [01:06<00:00,  3.01it/s]


tensor([1.], grad_fn=<ExpBackward0>)


100%|███████████████████████████████████████████████████████████████████████████████████████| 1700/1700 [00:42<00:00, 40.16it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:16<00:00, 611.13it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:39<00:00,  5.07it/s]


tensor([1.], grad_fn=<ExpBackward0>)


In [164]:
import numpy as np
scores = np.asarray(scoresList)
print(scores.mean(), "+-", scores.std())

nan +- nan


  print(scores.mean(), "+-", scores.std())
  ret = ret.dtype.type(ret / rcount)
  ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
  arrmean = um.true_divide(arrmean, div, out=arrmean, casting='unsafe',
  ret = ret.dtype.type(ret / rcount)


(2, 200)

In [33]:
diversityScores = np.asarray(diversityScores)
print(diversityScores.mean(), "+-", diversityScores.std())

3.8119928845419757 +- 1.4471593257966016


In [160]:
from junctionart.roundabout.Roundabout import Roundabout
from tqdm import tqdm 
def getRewards(roundabouts):
    rewards = []
    for roundaboutList in tqdm(roundabouts):
        rewardList = [roundabout.getReward() for roundabout in roundaboutList]
        rewards.append(rewardList)
    return rewards

def getDiversityScore(roundabouts):
    distances = []
    for i in tqdm(range(len(roundabouts))):
        for j in range(i + 1, len(roundabouts)):
            distance = RewardUtil.getDistance(roundabouts[i], roundabouts[j])
            distances.append(distance)

    distances = np.array(distances)
    return distances.sum() / (len(roundabouts) * (len(roundabouts) - 1))

roundabouts = output['roundabouts']
proxyRewards = output['proxyRewards']
# roundabouts = [roundaboutList[:50] for roundaboutList in roundabouts]
# proxyRewards = [pList[:50] for pList in proxyRewards]

rewards = np.asarray(getRewards(roundabouts))
proxyRewards = np.asarray(proxyRewards)

print(rewards.mean(), "+-", rewards.std())
print(proxyRewards.mean(), "+-", proxyRewards.std())

100%|█████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.20s/it]

6.5475 +- 0.708162234237325
0.7975 +- 0.01561249499599601





In [176]:
roadDefinition = roadDefs[0]
env = SetGenerationEnv(size, nActions, roadDefinition)

targetEntropy = -nActions
# log_alpha = torch.tensor([0.0], requires_grad=True)
# alpha = log_alpha.exp().detach()
alpha = 0.1

# models
pi_model = PolicyNet(state_dim=size, nActions=nActions).to(device)
q_origin_model1 = QNet(state_dim=size, nActions=nActions).to(device)  # Q_phi1
q_origin_model2 = QNet(state_dim=size, nActions=nActions).to(device)  # Q_phi2
q_target_model1 = QNet(state_dim=size, nActions=nActions).to(device)  # Q_phi1'
q_target_model2 = QNet(state_dim=size, nActions=nActions).to(device)  # Q_phi2'
_ = q_target_model1.requires_grad_(False)  # target model doen't need grad
_ = q_target_model2.requires_grad_(False)  # target model doen't need grad
buffer = replayBuffer(20000)

# optimizers
opt_pi = torch.optim.AdamW(pi_model.parameters(), lr=0.0005)
opt_q1 = torch.optim.AdamW(q_origin_model1.parameters(), lr=0.0005)
opt_q2 = torch.optim.AdamW(q_origin_model2.parameters(), lr=0.0005)
# alpha_optimizer = torch.optim.AdamW(params=[log_alpha], lr=0.0005) 
doneTraining = False

train(env, nIter=6000, batch_size=256, disableBar=True)
        
    


Run episode5999 with rewards 5.011872336272722 s [9.0, 16.0, 26.0, 26.0] ALPHA 0.111

In [184]:
states = []


for i in range(100):
    s = torch.zeros(1, size)
    done = False
    cum_reward = 0
    while not done:
        a = pick_sample((s/nActions).squeeze().tolist(), pi_model)

        s_next = env.update(s, torch.tensor([a]), inPlace = False)

        done = (s_next != 0).all().item()
        if done:
            config = (s_next.squeeze() - 1).long().tolist()
            r = 10**env.getProxyReward(config, normalize=True)

        else:
            r = 0
        cum_reward += r
        s = s_next

    # print(f"state {s.squeeze().tolist()} reward {cum_reward}")
    states.append((s.squeeze().tolist(), cum_reward))

In [185]:
def getNumberOfStates(samples):
    sampleCnt = {}
    for sample, reward in samples:
        if str(sample) in sampleCnt:
            sampleCnt[str(sample)] += 1
        else:
            sampleCnt[str(sample)] = 1
    return sampleCnt

# samples = sample(agent, 100)
len(getNumberOfStates(states))

91