## Tutorial 3: Demonstration of developing original *Agent* with DRL
This tutorial demonstrate how to develop *Agent* with DRL algorithm by using ***KSPDRLAgent*** . 

*Agent* base classes are as follows: 

- `Agent`(used in **Tutorial 2**)
- `KSPAgent`(used in **Tutorial 2**)
- `PrioritizedKSPAgent`(used in **Tutorial 2**)
- `KSPDRLAgent`

In [1]:
!pip install git+https://github.com/Optical-Networks-Group/rsa-rl.git

Collecting git+https://github.com/Optical-Networks-Group/rsa-rl.git
  Cloning https://github.com/Optical-Networks-Group/rsa-rl.git to /tmp/pip-req-build-y5wxtzeo
  Running command git clone -q https://github.com/Optical-Networks-Group/rsa-rl.git /tmp/pip-req-build-y5wxtzeo
Collecting bitarray>=1.2.1
[?25l  Downloading https://files.pythonhosted.org/packages/79/e8/7c8fafd17338bd2efdde30376903d13859bfecc24d564a5c1538b3a09338/bitarray-1.6.1.tar.gz (55kB)
[K     |████████████████████████████████| 61kB 5.1MB/s 
Collecting Sphinx>=2.4.0
[?25l  Downloading https://files.pythonhosted.org/packages/12/22/f89a67987342e7f8024785536160e5feb3bf458def49e8cf5113c999dd8a/Sphinx-3.3.1-py3-none-any.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 10.1MB/s 
[?25hCollecting sphinx-rtd-theme>=0.5.0
[?25l  Downloading https://files.pythonhosted.org/packages/c3/86/1addf25a238bbd8466bb099f23d9a9f13494b22b37b44f6c41a778b8730f/sphinx_rtd_theme-0.5.0-py2.py3-none-any.whl (10.8MB)
[K     |███████

## Evaluation Settings
For evaluation, prepare *Environment* and evaluation function. 
Please see **Tutorial 1** if you have not seen it. 

In [2]:
import functools
import numpy as np

from rsarl.envs import DeepRMSAEnv, make_multiprocess_vector_env
from rsarl.requester import UniformRequester
from rsarl.networks import SingleFiberNetwork
from rsarl.evaluator import batch_warming_up, batch_evaluation, batch_summary

In [3]:
# Set the device id to use GPU. To use CPU only, set it to -1.
gpu = 0

In [16]:
# exp settings
n_requests = 10000
n_envs, seed = 5, 0

# build network
net = SingleFiberNetwork("nsf", n_slot=100, is_weight=True)
# build requester
requester = UniformRequester(
    net.n_nodes,
    avg_service_time=10,
    avg_request_arrival_rate=12)
# build env
env = DeepRMSAEnv(net, requester)
# envs for training and evaluation
envs = make_multiprocess_vector_env(env, n_envs, seed, test=False)
test_envs = make_multiprocess_vector_env(env, n_envs, seed, test=True)

In [17]:
def _evaluation(envs, agent, n_requests): 
    # start simulation
    envs.reset()
    # 
    batch_warming_up(envs, agent, n_requests=3000)
    # evaluation
    experiences = batch_evaluation(envs, agent, n_requests=n_requests)
    # calc performance
    blocking_probs, avg_utils, total_rewards = batch_summary(experiences)

    for env_id, (blocking_prob, avg_util, total_reward) in enumerate(zip(blocking_probs, avg_utils, total_rewards)):
        print(f'[{env_id}-th ENV]Blocking Probability: {blocking_prob}')
        print(f'[{env_id}-th ENV]Avg. Slot-utilization: {avg_util}')
        print(f'[{env_id}-th ENV]Total Rewards: {total_reward}')

# evaluation with test environments
evaluation = functools.partial(_evaluation, envs=test_envs, n_requests=n_requests)

## Step1: Select DRL algorithm from PFRL
*RSA-RL* assumes that DRL algorithm provided by [PFRL](https://github.com/pfnet/pfrl) library is used. 
***PFRL*** is a DRL library that implements various state-of-the-art deep reinforcement algorithms in Python using[PyTorch](https://github.com/pytorch/pytorch).  
Discrete action algorithms are as follows: 

- ***DQN(Double DQN)***
- ***Rainbow***
- ***IQN***
- ***A3C***, ***A2C***
- ***ACER***
- ***PPO***
- ***TRPO***

In this tutorial, we try to reproduct the prior [DeepRMSA](https://ieeexplore.ieee.org/document/8386173) that applies DRL to ***routing algorithm*** that selects one from the *k* shortest paths. 
This tutorial call it ***DeepRMSAv1***, and implement it by using ***Double DQN (DDQN)***. 
In the case of using DDQN, there are three steps:

1. Build  deep neural network (DNN) model
2. Specify ***Explore*** and ***Replay Buffer***, e.g., epsilon greedy and prioritized replay buffer, respectively
3. Build DDQN

First, you develop a DNN that the number of outputs is *k*. 

In [6]:
import pfrl
import torch
import torch.nn as nn

In [7]:
class DeepRMSAv1_DNN(torch.nn.Module):

    def __init__(self, SLOT: int, ICH: int, K: int, n_edges: int):
        super().__init__()
        self.SLOT = SLOT
        # CNN
        self.conv = nn.Sequential(*[
            nn.Conv2d(ICH, 1, kernel_size=(1,1), stride=(1, 1)),
            nn.ReLU(),
            # 2 conv layers with16 filters
            nn.Conv2d(1, 16, kernel_size=(n_edges,1), stride=(1, 1)),
            nn.ReLU(),
            nn.Conv2d(16, 16, kernel_size=(1,1), stride=(1, 1)),
            nn.ReLU(),
            # 2 depthwise conv layers with 1 filter
            nn.ZeroPad2d((1, 0, 0, 0)), # left, right, top, bottom
            nn.Conv2d(16, 16, kernel_size=(1,2), stride=(1, 1), groups=16),
            nn.ReLU(),
            nn.ZeroPad2d((1, 0, 0, 0)),
            nn.Conv2d(16, 16, kernel_size=(1,2), stride=(1, 1), groups=16),
            nn.ReLU(),
        ])
        # fc
        self.fc = nn.Sequential(*[
            nn.Linear(SLOT*16, 128),
            nn.ReLU(),
            nn.Linear(128, 50),
            nn.ReLU(),
            nn.Linear(50, K),
        ])      

    def forward(self, x):
        h = x
        h = self.conv(h)
        h = h.view(-1, self.SLOT*16)
        h = self.fc(h)
        return pfrl.action_value.DiscreteActionValue(h)

In [8]:
# Experimental Settings
K = 5
# slot-table(1) + one-hot-node * 2 + bandwidth(1)
ICH = 1 + 2 * net.n_nodes + 1
# build DNN for Q-function
q_func = DeepRMSAv1_DNN( net.n_slot, ICH, K, net.n_edges)
# Specify optimizer 
optimizer = torch.optim.Adam(q_func.parameters(), eps=1e-2)

### Specify *Explore* and *Replay Buffer*
This tutorial selects ConstantEpsilonGreedy. 
If you want to use others, please refere *PFRL*'s documentation:
- [explore](https://pfrl.readthedocs.io/en/latest/explorers.html)
- [replay buffer](https://pfrl.readthedocs.io/en/latest/replay_buffers.html)

In [9]:
def _action_sampler(k):
    return np.random.randint(0, k)

# random action function
action_sampler = functools.partial(_action_sampler, k=K)

In [10]:
# Set the discount factor that discounts future rewards.
gamma = 0.99

# Use epsilon-greedy for exploration
explorer = pfrl.explorers.ConstantEpsilonGreedy(
    epsilon=0.1, random_action_func=action_sampler)

# DQN uses Experience Replay.
# Specify a replay buffer and its capacity.
replay_buffer = pfrl.replay_buffers.ReplayBuffer(capacity=10 ** 6, num_steps=50)

### Build DDQN
NOTE that since DeepRMSAv1 does not show sufficient information of hyper parameter, 
we cannot reproduct it precisely. 

In [11]:
# Now create an agent that will interact with the environment.
DDQN = pfrl.agents.DoubleDQN(
    q_func,
    optimizer,
    replay_buffer,
    gamma,
    explorer,
    minibatch_size=50,
    update_interval=1,
    replay_start_size=500,
    target_update_interval=100,
    gpu=gpu,
)

## Step 2: Develop your algorithm by using *KSPDRLAgent*
*RSA-RL* provides ***KSPDRLAgent*** that is based on *KSPAgent* class, which means that ***k-shortest path table***  can be used.   
You need to override two methods: 
- `preprocess`: create *feature vector* from *observation*
- `map_drlout_to_action`: map outputs of DRL algorithms to *Action*

In [12]:
import numpy as np
import networkx as nx
from rsarl.data import Action
from rsarl.agents import KSPDRLAgent
from rsarl.utils import cal_slot, sort_tuple
from rsarl.algorithms import SpectrumAssignment

In [13]:
def vectorize(n_nodes: int, node_id: int):
    mp = np.eye(n_nodes, dtype=np.float32)[node_id].reshape(-1, 1, 1)
    return mp

class DRLAgent(KSPDRLAgent):

    def preprocess(self, obs):
        """
        """
        net = obs.net
        source, destination, bandwidth, duration = obs.request
        # slot table
        whole_slot = np.array(list(nx.get_edge_attributes(net.G, name="slot").values()))
        whole_slot = whole_slot.reshape(1, net.n_edges, net.n_slot).astype(np.float32)
        # source, destination, bandwidth map
        smap = np.ones_like(whole_slot) * vectorize(net.n_nodes, source)
        dmap = np.ones_like(whole_slot) * vectorize(net.n_nodes, destination)
        bmap = np.ones_like(whole_slot) * bandwidth
        # concate: (1, ICH, #edges, #slots)
        fvec = np.concatenate([whole_slot, smap, dmap, bmap], axis=0)
        return fvec.astype(np.float32, copy=False)

    def map_drlout_to_action(self, obs, out):
        net = obs.net
        s, d, bandwidth, duration = obs.request
        paths = self.path_table[sort_tuple((s, d))]
        # map
        path = paths[out]

        #required slots
        path_len = net.distance(path)
        n_req_slot = cal_slot(bandwidth, path_len)
        #FF
        path_slot = net.path_slot(path)
        slot_index = SpectrumAssignment.first_fit(path_slot, n_req_slot)
        if slot_index is None:
            return None
        else:
            return Action(path, slot_index, n_req_slot, duration)

In [14]:
agent = DRLAgent(k=5, drl=DDQN)
# prepare path table
agent.prepare_ksp_table(net)

## Step 3: Training and Evaluate *DRL Agent*
Finally, let's training and evaluation! 
Interaction between *Agent* with *Environment* automatically trains *Agent*.  
NOTE that before evaluation, you should change DRL model to ***evaluation mode*** by `eval_mode` method that *explore* does not run. 

In [None]:
# Batch act
obses = envs.reset()
resets = [False for _ in range(len(obses))]
for train_cnt in range(200000):
    acts = agent.batch_act(obses)
    obses, rews, dones, infos = envs.step(acts)
    agent.batch_observe(obses, rews, dones, resets)

    # Make mask(not_end). 0 if done/reset, 1 if pass
    not_end = np.logical_not(dones)
    obses = envs.reset(not_end)
    
    if train_cnt % 20000 == 0:
        print(f'[{train_cnt}-th EVAL]')
        test_envs.reset()
        with agent.drl.eval_mode():
            evaluation(agent=agent)

[0-th EVAL]
[0-th ENV]Blocking Probability: 64.81
[0-th ENV]Avg. Slot-utilization: 0.4237841818181818
[0-th ENV]Total Rewards: -2962.0
[1-th ENV]Blocking Probability: 63.93
[1-th ENV]Avg. Slot-utilization: 0.4230850909090909
[1-th ENV]Total Rewards: -2786.0
[2-th ENV]Blocking Probability: 64.38000000000001
[2-th ENV]Avg. Slot-utilization: 0.4272641363636363
[2-th ENV]Total Rewards: -2876.0
[3-th ENV]Blocking Probability: 63.72
[3-th ENV]Avg. Slot-utilization: 0.41932254545454545
[3-th ENV]Total Rewards: -2744.0
[4-th ENV]Blocking Probability: 64.25
[4-th ENV]Avg. Slot-utilization: 0.4264771363636363
[4-th ENV]Total Rewards: -2850.0


## Conclusion

That's all! 
This tutorial demonstrates how to develop DRL *Agent*. 
Next tutorial demonstrate how to develop your own ***Environment***. 