# Second Scenario - Scaling

In [2]:
%load_ext tensorboard

# Objectives
The last prototype demonstrated good performance on smaller problems, but its performance was shown to quickly degrade on larger ones. This scenario intends to address this problem. 

# State
Agents know:

- The positions of the closest _i_ agents
- The positions of the closest _j_ sensors that have not been visited yet
- Their own position

Now that only _i_ agents and _j_ sensors are visible, the state space no longer linearly scales with the number of drones or sensors, it is constant. Also, since only non-visited sensors are in the state, agents no longer need to know which sensors have been visited.

# Reward
- Number of sensors visited for the first time in the last iteration
- -1 to an agent that leaves the scenario's area

This new reward intends to address the sparsity of the previous one. Rewards are more frequent, every time a sensor is visited for the first time instead of once in the end of the episode. It's also more forgiving, as agents are rewarded even if not all sensors are visited by the end of the scenario. This change intends to accelerate learning.

# Training
While a policy is being learned during training, it is important that a diverse set of experiences are collected, containing both successes and failures. A high quantity of experiences is not enough, they also have to be meaningful. In the previous scenario, if the agents could not visit all sensors and did not leave the simulation area, the episode would end at a parametrized time limit. Since the behavior of not leaving the scenario's area is pretty quickly learned, but the data collection behavior is not, simulations would run until the time limit very frequently. This means that a lot of meaningless data of the agents wondering around in the scenario until the time limit were being collected.

To help with this, this termination condition was changed. Instead of terminating the simulation at a parametrized time limit, a stall counter would be implemented. It resets to zero every time a new sensor is visited and counts the number of seconds since a new sensor was last visited. A new parameter was introduced to specify the maximum number of seconds stalled, having been reached the episode is terminated. Using this new strategy, simulations where meaningful data is being collected (sensors are being visited) are allowed to continue, whereas simulations where the agents are "confused" can be terminated early. This stall limit can be set much lower than the time limit.

# Algorithm
The DDPG algorithm makes a training iteration every episode iteration. Since we are using multiple agents, multiple experiences are collected at every iteration. Before we were training once every episode iteration, but we should really be training once for every experience collected, to match the experience x training ratio of the regular DDPG algorithm.

# Sensor number evaluation
To evaluate how these changes have affected results, we will run a campaign evaluating the performance on 

## Parameters
- Number of drones = 1
- Number of sensors = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
- Scenario square size = 100
- Training time = 10mil

## Results

In [4]:
%tensorboard --logdir runs/sensors

Reusing TensorBoard on port 6006 (pid 58345), started 0:25:33 ago. (Use '!kill 58345' to kill it.)