### Developing RL for team allocation.

We will compare the RL approach to current algorithms (Random, Basic, Basin), and conduct thorough benchmarking and
performance comparison.

Initially we need to work on a reduced problem size in order to stay within the 'tabular regime' and be able to use
dynamic programming solutions to the RL task (see [reinforcement_learning_plan](reinforcement_learning_plan.pdf)).

To begin development, we will work only optimising the hard skills of small teams in small organisations. This will
allow us to ensure that the algorithms are working correctly and to do some initial benchmarking to understand how the
problem will scale.

The plan of action involves:
1. [Define and count all possible states](#num_1)
2. [Add constraints to reduce size of state space](#num_2)
3. Create reduced probability function (hard skills only)
4. Write function to map state vector to ABM state (such that probability function can be called).
5. Define action space and constraints (not all actions possible in any given state - concise way to encode constraints?).
6. Define reward function for MDP (sparse? discounted?)
7. Define termination criteria for MDP
8. Implement dynamic programming algorithm (choice between value and policy iteration).
9. Test run of DP with benchmarking
10. Write function to map state vector to organisation strategy call (for Randon, Basic, Basin team allocation).
11. Test run of Random, Basic, Basin with benchmarking on reduced problem, for comparison with DP.
12. Run experiments to determine scaling characteristics with problem size.
(13. Implement neural network approach....)

In [None]:
from scipy.optimize import minimize, NonlinearConstraint, basinhopping

In [None]:
import sys, os
MODEL_DIR = os.path.realpath(os.path.dirname('..\superscript_model'))
sys.path.append(os.path.normpath(MODEL_DIR))

In [None]:
from superscript_model import model

In [None]:
STEPS = 25

abm = model.SuperScriptModel(worker_count=60, 
                             new_projects_per_timestep=2, 
                             worker_strategy = "Stake",
                             organisation_strategy = 'Basic')

<a id='num_1'></a>
#### 1. Defining the state vector (and listing all states).

Here we define the state vector for the hard-skill-only problem.

Based on the table in [reinforcement_learning_plan](reinforcement_learning_plan.pdf), we have 2H*5 * 5 * WH*5 * 2*W*H
possible states,  where H is the number of hard skills and W is the number of workers.

In [None]:
# code to create and populate state vector

In [None]:
# code to count number of states and size in memory/on disk

In [None]:
# code to compare size with theoretical calculation

<a id='num_2'></a>
#### 2. Adding constraints on state space size.

Not all states are valid.

We have constraints based on@
- team size (min 3, max 7).
- team budget
- project requirements

In [None]:
# code to define constraints and produce reduced state vector

In [None]:
# code to count number of valid states and size in memory/on disk. Quantify reduction.
