# Training

## Create a Description of the Problem

In [1]:
from problem_description_and_state import ProblemDescription

pd = ProblemDescription(inventory_cols = 2, inventory_rows = 2)
# other aspects (allowed verbs like store and restore, allowed item types / colors)
# can be influenced by optional parameters

## Model

### States

A state fully describes an inventory with its stored items (where they are stored, what color they have). It also describes the request (expressed as verb and color, e.g. _store red_).

The invalid state is a special case. It has no inventory and no request and it can't be left once entered. Encountering an invalid (i.e. impossible) request or choosing an invalid action (see below) leads to that state.

### Actions

An action describes which inventory slot is used to fulfill the request (as given by the state).

There is one action for each inventory slot. The action n is thus:
_Given current inventory state and given the request (verb + color) I choose inventory slot n to fulfill the request._

If the request cannot be fulfilled (inventory is full and a store-request is given, or a restore-request for a color that does not exist in the inventory is given) -- which is called an _invalid request_ -- the request should be ignored (which is a special action) and the system enters an _invalid state_. (Availability of that action can be switched off in the `ProblemDescription` constructor.)

The concept of an _invalid action_ is an action that is clearly a logical error, even for a greedy algorithm, e.g. storing an item in an occupied slot if there are other slots which are free, or restoring an item of the wrong color if one of the proper color exists.

### Transition Probability Matrices

#### Structure
Any numbers given are for a 3x2 inventory.

* dimension 1 = actions (equal for all states)
  * actions: choose inventory slot 1 / 2 / .. / 6 or ignore invalid request
  * invalid request doesn't happen in training and test data -> could ignore it for now and append it after training
* dimension 2 = from_state
* dimension 3 = to_state
  * state: defined by inventory and request -> 4^6 \* 2\*3 states
  * 4: describes state of a single inventory slot (empty, or one of the three colors)
  * 6: number of inventory slots (their order matters)
  * 2: possible verbs (store / restore)
  * 3: possible colors
* value = probability
* the inventory of the to_state is deterministic (determined by executing the verb-color request on the inventory slot specified through the action on the inventory of the from_state)
* verb and color of the to_state are stochastic, i.e. their probabilities have to learned from the training data (see below)

#### Implementation
* variables like inventory slot, slot content, verb, color are expressed as an integer and are zero-based
  * verb: 0 means store, 1 means restore; zero-based index into `problem_description.verbs`
  * color: zero-based index into `problem_description.colors`
  * inventory slot index: zero-based, counting from left to right and from top to bottom in a row-like fashion
  * inventory slot content: 0 means empty, 1+ means filled (value = 1 + color)
  * action: equal to index of chosen inventory slot; special case: ignore request action is equal to the number of inventory slots
* dimension 1 (actions) as a list (or iterable)
* dimensions 2 and 3 (from state A to state B) as a scipy.sparse.csr_matrix

#### Converting State to State Index and Back

See implementation at `problem_description_and_state.py` class `State`:
* member function `get_index()`
* factory method `from_index()`
* state enumeration: index = ...
  * requested_color +
  * #colors \* requested_verb +
  * #colors \* #verbs \* (1 + #colors)^0 \* inventory_slot_1 +
  * #colors \* #verbs \* (1 + #colors)^1 \* inventory_slot_2 +
  * #colors \* #verbs \* (1 + #colors)^2 \* inventory_slot_3 +
  * #colors \* #verbs \* (1 + #colors)^3 \* inventory_slot_4 +
  * #colors \* #verbs \* (1 + #colors)^4 \* inventory_slot_5 +
  * #colors \* #verbs \* (1 + #colors)^5 \* inventory_slot_6

Test case:

In [2]:
from problem_description_and_state import ProblemDescription, State

pd_3x2 = ProblemDescription(3, 2)

s1 = State([1,2,0,3,0,0], 0, 1)
s1_index = 1 + 3*0 + 3*2*(1 + 4*2 + (4**3)*3)
assert s1.get_index(pd_3x2) == s1_index
assert s1 == State.from_index(pd_3x2, s1_index)

for s_index in range(pd_3x2.number_of_states):
    s = State.from_index(pd_3x2, s_index)
    if s_index == pd_3x2.invalid_state_index:
        s2 = pd_3x2.invalid_state
    else:
        s2 = State(s.inventory, s.verb, s.color)
    assert s2.get_index(pd_3x2) == s_index

#### Read Training Data to Determine State Transition Probabilities

In [3]:
from data_exploration_classes import Data
from problem_description_and_state import ProblemDescription, State, ColorCountState, InventoryLevelState
from training_classes import TransitionProbabilityLearner
import collections
import statistics

data = Data(pd, 'data/2x2/warehousetraining.txt')
print('{} requests read from file {}'.format(len(data.requests), data.filepath))

learner = TransitionProbabilityLearner(pd, data, min_support=5)
print(('Accounting for inventory item order there are {} unique valid state transitions\n'
       '(out of {} according to matrix shape).').format(
    sum((1 for _ in pd.get_valid_state_transitions())),
    pd.number_of_states ** 2
))
print('Ignoring inventory item order a total of {} out of {} unique valid state transitions were observed.'.format(
    learner.color_count_matrix.get_number_of_entries(),
    sum((1 for _ in pd.get_valid_color_count_state_transitions()))
))
transitions = collections.defaultdict(list)
supports = []
for (from_si, to_si) in pd.get_valid_color_count_state_transitions():
    transitions[from_si].append(to_si)
for from_si in sorted(transitions):
    support = sum((learner.color_count_matrix.get(from_si, to_si) for to_si in transitions[from_si]))
    supports.append(support)
    # if support < learner.min_support * len(transitions[from_si]):
    #     print('insufficient support ({: >2} < {}*{}) for {}'.format(
    #         support, learner.min_support, len(transitions[from_si]), ColorCountState.from_index(pd, from_si)
    #     ))
print(('\tsupport statistics (number of observed transitions from a state):\n'
       '\t\tmin {}, max {}, mean {:.0f}, median {:.0f}, stddev {:.1f}').format(
    min(supports), max(supports), statistics.mean(supports), statistics.median(supports), statistics.stdev(supports)
))
print(('Additionally ignoring color {} out of {} unique valid state transitions were observed.').format(
    learner.inventory_level_matrix.get_number_of_entries(),
    sum((1 for _ in pd.get_valid_inventory_level_state_transitions()))
))

8177 requests read from file data/2x2/warehousetraining.txt
Accounting for inventory item order there are 6624 unique valid state transitions
(out of 2362369 according to matrix shape).
Ignoring inventory item order a total of 443 out of 480 unique valid state transitions were observed.
	support statistics (number of observed transitions from a state):
		min 2, max 394, mean 68, median 42, stddev 70.5
Additionally ignoring color 14 out of 14 unique valid state transitions were observed.


This illustrates that the granularity at which the transitions are observed is important. There might not be enough data for rare transitions at an observation level of high granularity. Observing transitions at different levels of granularity allows to fall back to more coarse levels when needed.

The `TransitionProbabilityLearner` class uses 3 levels:
1. state defined as number of items of each color, and request (verb + color)
  * implemented as `ColorCountState` in `problem_description_and_state.py`
1. state defined as number of items and verb (ignoring color)
  * implemented as `InventoryLevelState` in `problem_description_and_state.py`
1. no concept of state, only counting colors frequencies

The decision when to fall back to a more coarse level is controlled through the `min_support` parameter (default 5), which describes how many samples per successor state must have been observed for the current state (from_state) to avoid a fall back.

#### Defining the Row Entries of the Transition Probability Matrix and Constructing a Sparse Matrix from the Row Entries

See implementation at `training_classes.py` class `TransitionProbabilityMatrix`:
* initialization via `tpms = TransitionProbabilityMatrix(problem_description, training_data_statistics)`
* access to an individual SxS transition probability matrix for action A is done via indexing an instance of the class
  * e.g. for action 0: `tpms[0]`
* the class supports `len(instance)` to complete the array-like behavior and to return the number of actions there are matrices for
* accessing an index lazily creates a `scipy.sparse.csr_matrix` via member function `get_tpm()` and the resulting matrix is cached for later access
* `get_tpm()` uses helper function `get_tpm_row()` to get the entries for a certain row
  * the returned entries are the indices of the successor states for the state given by the row index and their probabilities
* the probabilities are given by the `TransitionProbabilityLearner` class mentioned above

Test cases for row entries:

In [4]:
from problem_description_and_state import ProblemDescription, State
from data_exploration_classes import Data
from training_classes import TransitionProbabilityLearner, TransitionProbabilityMatrix

pd_3x2 = ProblemDescription(3, 2)
data_3x2 = Data(pd_3x2, 'data/3x2/warehousetraining.txt')
learner_3x2 = TransitionProbabilityLearner(pd_3x2, data_3x2, min_support=5)
tpm_3x2 = TransitionProbabilityMatrix(pd_3x2, learner_3x2)

# check if there are always between 1 and 6 successor states for any state
# 1 is the minimum because there is at least the invalid state as a successor state
# 6 is the maximum because there can only be at most one successor state inventory
#   (deterministic) and up to (#verbs * #colors)=6 possible requests
for action in range(0, pd_3x2.number_of_actions):
    for state_index in range(0, pd_3x2.number_of_states):
        assert 1 <= len(tpm_3x2.get_tpm_row(action, state_index)) <= 6

# check if successor states for state 0 (empty inventory, store color1)
# and action 0 (store in first inventory slot)
# are 6-9 (6: inventory has color1 in slot1, next request is store color1)
# (7-9: other next requests [store other color or restore color1])
action = 0
verb_store = 0
color1 = 0
s0 = State([0,0,0,0,0,0], verb_store, color1)
s0_index = s0.get_index(pd_3x2)
assert s0_index == 0
row = tpm_3x2.get_tpm_row(action, s0_index)
assert [cell[0] for cell in row] == [6, 7, 8, 9]

# check if trying to store color1 into an inventory where slot1 already contains color1 leads to an invalid state
s6 = State([1,0,0,0,0,0], verb_store, color1)
s6_index = s6.get_index(pd_3x2)
assert s6_index == 6
row = tpm_3x2.get_tpm_row(action, s6_index)
assert [cell[0] for cell in row] == [pd_3x2.invalid_state_index]

# check if trying to restore color1 from an empty inventory leads to an invalid state
verb_restore = 1
s3 = State([0,0,0,0,0,0], verb_restore, color1)
s3_index = s3.get_index(pd_3x2)
assert s3_index == 3
row = tpm_3x2.get_tpm_row(action, s3_index)
assert [cell[0] for cell in row] == [pd_3x2.invalid_state_index]

# check if trying to restore color2 from an inventory slot with color1 leads to an invalid state
color2 = 1
s10 = State([1,0,0,0,0,0], verb_restore, color2)
s10_index = s10.get_index(pd_3x2)
assert s10_index == 10
row = tpm_3x2.get_tpm_row(action, s10_index)
assert [cell[0] for cell in row] == [pd_3x2.invalid_state_index]

Instantiation and some debug output to illustrate the sparse matrices:

In [5]:
tpms = TransitionProbabilityMatrix(pd, learner)

print('according to problem description there are {} states for a {}x{} inventory (including the invalid state)'.format(
    pd.number_of_states, pd.number_of_inventory_cols, pd.number_of_inventory_rows
))
print('a dense transition probability matrix would have {}x{} = {} entries'.format(
    pd.number_of_states, pd.number_of_states, pd.number_of_states ** 2
))
print('using sparse matrices yields the following results:')
for (action, tpm) in enumerate(tpms):
    print('action {}: transition probability matrix has {} explicit entries'.format(
        action, tpms[action].nnz
    ))

according to problem description there are 1537 states for a 2x2 inventory (including the invalid state)
a dense transition probability matrix would have 1537x1537 = 2362369 entries
using sparse matrices yields the following results:
action 0: transition probability matrix has 2809 explicit entries
action 1: transition probability matrix has 2809 explicit entries
action 2: transition probability matrix has 2809 explicit entries
action 3: transition probability matrix has 2809 explicit entries
action 4: transition probability matrix has 1537 explicit entries


Note: the last matrix is for the _ignore request_ action, where each state only leads to the invalid state

### Reward Matrices

#### Structure

* dimension 1: from_state
* dimension 2: actions
* value = reward

#### Implementation

* as a numpy array of type int

#### Defining the Reward for each State and Action

* should be (-1) \* the Manhattan Distance from the last inventory slot to the slot chosen by the action
  * the `ProblemDescription` class has a method to calculate that distance as it is specific to the inventory shape
* should be less for an invalid request or the invalid state (penalize improper actions even more) to deter the agent from entering the invalid state or from choosing invalid actions

See implementation at `problem_description_and_state.py` class `State`:
* member function `get_reward(problem_description, action)` (where action is a number equal to the slot number starting with 0)
  * uses helper function `is_invalid_state` to identify the invalid state
  * uses helper function `is_invalid_action` to identify whether the chosen action would lead to an (avoidable) error (e.g. storing an item in an occupied slot if there are other slots which are free, or restoring an item of the wrong color)
  * uses `problem_description.get_manhattan_distance_to_last_inventory_slot(action)` to get the distance
  
Test cases for rewards:

In [6]:
pd_3x2 = ProblemDescription(3, 2)
s0 = State.from_index(pd_3x2, 0)

# storing color1 item in last inventory slot (with distance == 0) should give max. reward of 0
assert s0.get_reward(pd_3x2, action=5) == 0
# storing color1 item in first inventory slot (with distance == 3) should give min. typical reward of -3
assert s0.get_reward(pd_3x2, action=0) == -3
# all actions in the invalid state should give a heavy penalty (to deter from even entering that state)
for action in range(0, pd_3x2.number_of_actions):
    assert pd_3x2.invalid_state.get_reward(pd_3x2, action) <= -pd.number_of_inventory_slots

s6 = State.from_index(pd_3x2, 6)
# s6: inventory == [1,0,0,0,0,0], verb == store, item_type == color1
# storing color1 item in the first inventory slot (again) should be an invalid action
# and be penalized heavily (to deter from choosing such actions)
assert s6.get_reward(pd_3x2, action=0) < -pd.number_of_inventory_slots

#### Constructing the Reward Matrix

See implementation at `training_classes.py` classes `RewardMatrixS1` (similar to `TransitionProbabilityMatrix`) or `RewardMatrixSA` (of shape SxA).

It turned out that -- contrary to the documentation -- using `RewardMatrixS1` as a list of matrices (one for each action) is incompatible with the MDP solver library and one reward matrix of shape SxA (as in `RewardMatrixSA`) must be used instead.

In [7]:
from training_classes import RewardMatrixSA

reward_matrix = RewardMatrixSA(pd).get()

print('reward matrix shape: {}'.format(reward_matrix.shape))

reward matrix shape: (1537, 5)


## Running the MDP Solver

In [8]:
import mdptoolbox

solver = mdptoolbox.mdp.PolicyIteration(
    list(tpms),
    reward_matrix,
    0.99,
    max_iter=10,
)
solver.run()

print('MDP solved after {} iterations, using {} cpu seconds'.format(solver.iter, solver.time))
print('produced policy is a tuple of length {}'.format(len(solver.policy)))
print('optimal policy:')
solver.policy

MDP solved after 10 iterations, using 2.496164083480835 cpu seconds
produced policy is a tuple of length 1537
optimal policy:


(3,
 3,
 3,
 4,
 4,
 4,
 3,
 3,
 3,
 0,
 4,
 4,
 3,
 3,
 3,
 4,
 0,
 4,
 3,
 3,
 3,
 4,
 4,
 0,
 3,
 3,
 3,
 1,
 4,
 4,
 3,
 3,
 3,
 1,
 4,
 4,
 3,
 3,
 3,
 1,
 0,
 4,
 3,
 3,
 3,
 1,
 4,
 0,
 3,
 3,
 3,
 4,
 1,
 4,
 3,
 3,
 3,
 0,
 1,
 4,
 3,
 3,
 3,
 4,
 1,
 4,
 3,
 3,
 3,
 4,
 1,
 0,
 3,
 3,
 3,
 4,
 4,
 1,
 3,
 3,
 3,
 0,
 4,
 1,
 3,
 3,
 3,
 4,
 0,
 1,
 3,
 3,
 3,
 4,
 4,
 1,
 3,
 3,
 3,
 2,
 4,
 4,
 3,
 3,
 3,
 2,
 4,
 4,
 3,
 3,
 3,
 2,
 0,
 4,
 3,
 3,
 3,
 2,
 4,
 0,
 3,
 3,
 3,
 1,
 4,
 4,
 3,
 3,
 3,
 1,
 4,
 4,
 3,
 3,
 3,
 2,
 0,
 4,
 3,
 3,
 3,
 1,
 4,
 0,
 3,
 3,
 3,
 2,
 1,
 4,
 3,
 3,
 3,
 2,
 1,
 4,
 3,
 3,
 3,
 2,
 1,
 4,
 3,
 3,
 3,
 2,
 1,
 0,
 3,
 3,
 3,
 2,
 4,
 1,
 3,
 3,
 3,
 2,
 4,
 1,
 3,
 3,
 3,
 2,
 0,
 1,
 3,
 3,
 3,
 2,
 4,
 1,
 3,
 3,
 3,
 4,
 2,
 4,
 3,
 3,
 3,
 0,
 2,
 4,
 3,
 3,
 3,
 4,
 2,
 4,
 3,
 3,
 3,
 4,
 2,
 0,
 3,
 3,
 3,
 1,
 2,
 4,
 3,
 3,
 3,
 1,
 2,
 4,
 3,
 3,
 3,
 1,
 2,
 4,
 3,
 3,
 3,
 1,
 2,
 0,
 3,
 3,
 3,
 4,
 1,
 4,
 3,
 3,
 3,
 0,
