# 1: Brute Force

In this notebook, we explore the results generated by the `1_bruteforce.sh` script, where we basically just throw a bunch of compute at the problem and see what happens for different configurations.

The configurations that we explore are:
  - BL: baseline, so no experience replay, no target network
  - ER: only experience replay (replay buffer size 10k)
  - TN: only target network (update frequency 1k)
  - TR: both target network and experience replay (same hyperparameters as above)

The other settings are as follows:

| parameter                  | value
|----------------------------|------
| number of individual runs  | 6
| number of episodes         | 50k
| network architecture       | 2 hidden layers with 24 units each (ReLU activation)
| exploration strategy       | $\varepsilon$-greedy
| annealing scheme           | scheme 1: 1.0 $\to$ 0.01 over 80% of episodes (exp.)
| batch size                 | 512
| learning rate ($\alpha$)   | 0.001 (Adam optimizer)
| discount factor ($\gamma$) | 0.999


## Preliminaries

In [None]:
import os
from pathlib import Path

from dql.utils.namespaces import P
from dql.utils.datamanager import ConcatDataManager
from dql.utils.plotter import ColorPlot, LossPlot, ComparisonPlot

import numpy as np
import matplotlib.pyplot as plt

Check if we have the data.

Should be BL, ER, TN, and TR.

In [None]:
expID = 'ABF'
runIDs = [f for f in os.listdir(P.data) if f.startswith(expID + '-')]
print('\n'.join(runIDs))

Check if the parameters are correct.
We check for the run using the `TR` config, since it will contain all the hyperparameters.

In [None]:
ConcatDataManager(f'{expID}-TR').printSummary()

## Plotting

Define a function to easily get all figures for a given run.

In [None]:
runNames = {'BL': 'Baseline', 'ER': 'Experience Replay', 'TN': 'Target Network', 'TR': 'Target Network + Experience Replay'}

def getFigs(runID: str) -> tuple[plt.Figure]:
    title = f'| {runNames[runID]}\nNaive Approach'
    DM = ConcatDataManager(f'{expID}-{runID}')

    R = DM.loadRewards()
    fR = ColorPlot(R, label='reward', title=title).getFig()

    A = DM.loadActions()
    AB = np.abs((A / np.sum(A, axis=2, keepdims=True))[:, :, 0] - .5) * 2
    fAB = ColorPlot(AB, label='action bias', title=title).getFig()

    L = DM.loadLosses()
    fL = LossPlot(L, title=title).getFig()
    return fR, fAB, fL

---
### Baseline

In [None]:
runID = 'BL'
rewardFig, actionBiasFig, lossFig = getFigs(runID)
rewardFig.savefig(Path(P.plots) / f'{expID}-{runID}-R.png', dpi=500, bbox_inches='tight')
actionBiasFig.savefig(Path(P.plots) / f'{expID}-{runID}-AB.png', dpi=500, bbox_inches='tight')
lossFig.savefig(Path(P.plots) / f'{expID}-{runID}-L.png', dpi=500, bbox_inches='tight')

---
### Experience Replay

In [None]:
runID = 'ER'
rewardFig, actionBiasFig, lossFig = getFigs(runID)

---
### Target Network

In [None]:
runID = 'TN'
rewardFig, actionBiasFig, lossFig = getFigs(runID)

---
### Target Network + Experience Replay

In [None]:
runID = 'TR'
rewardFig, actionBiasFig, lossFig = getFigs(runID)
actionBiasFig.savefig(Path(P.plots) / f'{expID}-{runID}-AB.png', dpi=500, bbox_inches='tight')

---
### Comparison

In [None]:
data = []
# redefine runIDs to get the correct order
runIDs = ['BL', 'ER', 'TN', 'TR']
for runID in runIDs:
    DM = ConcatDataManager(f'{expID}-{runID}')
    R = DM.loadRewards()
    A = DM.loadActions()
    AB = np.abs((A / np.sum(A, axis=2, keepdims=True))[:, :, 0] - .5) * 2
    data.append((R, AB))

In [None]:
fig = ComparisonPlot(data, runIDs, 'Naive Approach').getFig()
fig.savefig(Path(P.plots) / f'{expID}-C.png', dpi=500, bbox_inches='tight')