# 2: Random Warmup

In this notebook, we explore the results generated by the `2_randomwarmup.sh` script, where we reduce the number of total episodes from 50k (see `1_bruteforce.ipynb`) to 20k and make the agents start with a random policy 10k _steps_ before their actual training starts.
This way, their can fill up their buffer(s) with random (i.e., not biased) experiences before they start learning from them.

The idea is that this should help the agents overcome the initial bias they seem to be suffering from in the first $\pm$ 20k episodes.
If we see any improvements to their rewards and/or action biases, we can infer that using a random warmup is a good idea.

The configurations that we explore are largely similar to the brute force experiments.
The only difference being that for the agents using TN, the target network update frequency is reduced from 2.5k to 1k.
This way, the target network is now updated just as many times as in the brute force experiments.

The other settings are also pretty much the same, but with these adjustments:

| parameter                     | previous value | new value
|-------------------------------|----------------|----------
| number of individual runs     | 6              | 5
| number of episodes            | 50k            | 20k
| number of random warmup steps | 0              | 10k

In addition to the above, we also explore the difference between two distinct annealing schemes for the $\varepsilon$-greedy exploration strategy.

The first one is the one we used in the brute force experiments, where the $\varepsilon$ value is exponentially annealed from 1 to 0.01 over the course of the first 80% of the total number of episodes.
The second one is a slightly less aggressive annealing scheme, where the $\varepsilon$ value is annealed linearly from 1 to 0.1 over the course of the first 50% of the total number of episodes.

| scheme | from | to   | window | kind
|--------|------|------|--------|------
| 0      | 1    | 1    | 0      | -
| 1      | 1    | 0.01 | 80%    | exponential
| 2*     | 1    | 0.01 | 80%    | linear
| 3*     | 1    | 0.1  | 50%    | exponential
| 4      | 1    | 0.1  | 50%    | linear

\* Schemes 2 and 3 are not examined in this project, but are included in the `agent.annealing` module for possible later exploration.

## Preliminaries

In [None]:
import os
from pathlib import Path

from dql.utils.namespaces import P
from dql.utils.datamanager import ConcatDataManager
from dql.utils.plotter import ColorPlot, LossPlot, ComparisonPlot

import numpy as np
import matplotlib.pyplot as plt

Check if we have the data.

Should be BL, ER, TN, and TR for both annealing schemes.

In [None]:
runIDs = [f for f in os.listdir(P.data) if f.startswith('AA')]
print('\n'.join(runIDs))

Check if the parameters are correct.
We check for the run using the `TR` config, since it will contain all the hyperparameters.
For the first annealing scheme, we print the full summary.

In [None]:
ConcatDataManager('AA1-TR').printSummary()

The `AA4` run only differs in annealing scheme, so we load and print this separately.

In [None]:
for k, v in ConcatDataManager('AA4-TR').loadSummary().params.annealingScheme.items():
    print(f'{k}: {v}')

## Plotting

Define a function to easily get all figures for a given run.

In [None]:
runNames = {'BL': 'Baseline', 'ER': 'Experience Replay', 'TN': 'Target Network', 'TR': 'Target Network + Experience Replay'}

def getFigs(runID: str, exp: int) -> tuple[plt.Figure]:
    expID, epxName = ('AA1', 'Annealing Scheme 1') if exp == 1 else ('AA4', 'Annealing Scheme 4')
    
    title = f'| {runNames[runID]}\n({epxName})'
    DM = ConcatDataManager(f'{expID}-{runID}')

    R = DM.loadRewards()
    fR = ColorPlot(R, label='reward', title=title).getFig()

    A = DM.loadActions()
    AB = np.abs((A / np.sum(A, axis=2, keepdims=True))[:, :, 0] - .5) * 2
    fAB = ColorPlot(AB, label='action bias', title=title).getFig()

    L = DM.loadLosses()
    fL = LossPlot(L, title=title).getFig()
    return fR, fAB, fL

---
### Baseline

In [None]:
runID = 'BL'
rewardFig, actionBiasFig, lossFig = getFigs(runID, 1)
rewardFig.savefig(Path(P.plots) / f'AA1-{runID}-R.png', dpi=500, bbox_inches='tight')

In [None]:
rewardFig, actionBiasFig, lossFig = getFigs(runID, 4)

---
### Experience Replay

In [None]:
runID = 'ER'
rewardFig, actionBiasFig, lossFig = getFigs(runID, 1)

In [None]:
rewardFig, actionBiasFig, lossFig = getFigs(runID, 4)

---
### Target Network

In [None]:
runID = 'TN'
rewardFig, actionBiasFig, lossFig = getFigs(runID, 1)

In [None]:
rewardFig, actionBiasFig, lossFig = getFigs(runID, 4)

---
### Target Network + Experience Replay

In [None]:
runID = 'TR'
rewardFig, actionBiasFig, lossFig = getFigs(runID, 1)

In [None]:
rewardFig, actionBiasFig, lossFig = getFigs(runID, 4)
rewardFig.savefig(Path(P.plots) / f'AA4-{runID}-R.png', dpi=500, bbox_inches='tight')

---
### Comparison

In [None]:
data1 = []
data4 = []
# redefine runIDs to get the correct order
runIDs = ['BL', 'ER', 'TN', 'TR']
for runID in runIDs:
    DM1 = ConcatDataManager(f'AA1-{runID}')
    DM4 = ConcatDataManager(f'AA4-{runID}')
    R1, R4 = DM1.loadRewards(), DM4.loadRewards()
    A1, A4 = DM1.loadActions(), DM4.loadActions()
    AB1 = np.abs((A1 / np.sum(A1, axis=2, keepdims=True))[:, :, 0] - .5) * 2
    AB4 = np.abs((A4 / np.sum(A4, axis=2, keepdims=True))[:, :, 0] - .5) * 2
    data1.append((R1, AB1))
    data4.append((R4, AB4))

In [None]:
fig1 = ComparisonPlot(data1, runIDs, 'Annealing Scheme 1').getFig()
fig1.savefig(Path(P.plots) / 'AA1-C.png', dpi=500, bbox_inches='tight')
fig4 = ComparisonPlot(data4, runIDs, 'Annealing Scheme 4').getFig()
fig4.savefig(Path(P.plots) / 'AA4-C.png', dpi=500, bbox_inches='tight')