# MDP Lesson 2: additional methods

## Import the modules

In [1]:
import marmote.core as mc
import marmote.markovchain as mmc
import marmote.mdp as md

also import numpy

In [2]:
import numpy as np

## Table of Contents

1. [Model](#model)
2. [Build the MDP - Constructor 1](#build-the-mdp---constructor-1)
3. [Build the MDP - Constructor 2](#build-the-mdp---constructor-2)
4. [Build the MDP - Constructor 3](#build-the-mdp---constructor-3)
5. [Q-Value](#q-value)

## Model

[Back to Table of Contents](#table-of-contents)

### Description of the Model
We use a model with three states *s1, s2, s3* and three actions *a0, a1, a2*. Transition probabilities and rewards are described by the picture below.

<img src="./geron.png">

## Build the MDP - Constructor 1

[Back to Table of Contents](#table-of-contents)

### Creating States

In [3]:
actionSpace = mc.MarmoteInterval(0, 2)
stateSpace = mc.MarmoteInterval(0, 2)

### Some modelling choices

As it could be noticed, the number of actions is not the same in each state. In state *s1* one could trigger any of the actions *a0*, *a1*, and *a2* while in state *s3* only action *a2* can be triggered. 

To make programming easier, we have chosen to have an identical action space in each state. This means that we can activate all actions *a0*, *a1*, *a2* in all state. To do this, we add missing actions which will have no effect and which will receive a high cost. 

Hence in state *s3* we add action *a0* with a transition to *s3* with probability *1*. 

Hence in state *s3* we add action *a0* with a transition to *s3* with probability *1*. 

Hence in state *s2* we add action *a2* with a transition to *s2* with probability *1*.

Here we enter the transition matrices (we do not entry the null values) and add each matrix to a list. The matrices obey to the modeling choice presented above and then have the same dimension.

## Build the MDP - Constructor 2

[Back to Table of Contents](#table-of-contents)

### Creating Transitions Matrices

In [4]:
trans = list()

# matrix for the a_0 action
P0 = mc.SparseMatrix(3)
P0.setEntry(0, 0, 0.7)
P0.setEntry(0, 1, 0.3)
P0.setEntry(1, 1, 1.0)
P0.setEntry(2, 2, 1.0)
trans.append(P0)

# matrix for the a_1 action
P1 = mc.SparseMatrix(3)
P1.setEntry(0, 0, 1.0)
P1.setEntry(1, 2, 1.0)
P1.setEntry(2, 2, 1.0)
trans.append(P1)

# matrix for the a_2 action
P2 = mc.SparseMatrix(3)
P2.setEntry(0, 0, 0.8)
P2.setEntry(0, 1, 0.2)
P2.setEntry(1, 1, 1.0)
P2.setEntry(2, 0, 0.8)
P2.setEntry(2, 1, 0.1)
P2.setEntry(2, 2, 0.1)
trans.append(P2)

## Build the MDP - Constructor 3

[Back to Table of Contents](#table-of-contents)

### Creation of several rewards matrices

Since the reward values depend on the transition with a cost per transition, then we use a second way to build the MDP. 
We use a second constructor which uses two lists of matrices. One for the transition matrices as before and another for the rewards per transition. For the later list, the matrix at the *k*-th entry defines the gains associated with the action with index *k*. In this matrix, the entry with coordinate (i,j) defines the reward of the transition from i to j.

We also define the penalty given to unavailable actions. Here a small negative value (-10^5) is used.

In [5]:
penalty = -100000

R1 = mc.SparseMatrix(3)
R2 = mc.SparseMatrix(3)
R3 = mc.SparseMatrix(3)

# fill in non null entries in sparse matrix
R1.setEntry(0, 0, 10)
R1.setEntry(2, 2, penalty)

R2.setEntry(1, 2, -50)
R2.setEntry(2, 2, penalty)

R3.setEntry(1, 1, penalty)
R3.setEntry(2, 0, 40)

# Adding reward to list
rews = list()
rews.append(R1)
rews.append(R2)
rews.append(R3)

Let us check the matrices

In [6]:
print("Checking")
print("R1", str(R1))
print("R2", str(R2))
print("R3", str(R3))

Checking
R1 [[1.000000e+01, 0.000000e+00, 0.000000e+00],
 [0.000000e+00, 0.000000e+00, 0.000000e+00],
 [0.000000e+00, 0.000000e+00, -1.000000e+05]]

R2 [[0.000000e+00, 0.000000e+00, 0.000000e+00],
 [0.000000e+00, 0.000000e+00, -5.000000e+01],
 [0.000000e+00, 0.000000e+00, -1.000000e+05]]

R3 [[0.000000e+00, 0.000000e+00, 0.000000e+00],
 [0.000000e+00, -1.000000e+05, 0.000000e+00],
 [4.000000e+01, 0.000000e+00, 0.000000e+00]]



### Parameters definition

In [7]:
beta = 0.95
criterion = "max"

### Build the MDP

In [8]:
second_mdp = md.DiscountedMDP(criterion, stateSpace, actionSpace, trans, rews, beta)

## Q-Value

[Back to Table of Contents](#table-of-contents)

### Creation of the Q Value associated with a policy

It also possible to create a `FeedbackQvalueMDP` in a `DiscountedMDP`. A `FeedbackQvalueMDP` is an object that is created form the value of a policy (in our case a `FeedbackSolutionMDP`). It stores a *Q-value* for any couple *(s,a)* with *s* the state and *a* the action. From that, it is then possible to randomly draw actions according to the *EpsilonGreedy* or *Softmax* rules.

Create the `FeedbackQvalueMDP` object

In [9]:
F = second_mdp.GetQValue(optimum2)

NameError: name 'optimum2' is not defined

Then we print it

In [10]:
print(F)

NameError: name 'F' is not defined

For drawing action we should reset the random generator

In [11]:
F.ResetSeed()

NameError: name 'F' is not defined

We draw an action with *EpsilonGreedy* principle in state 0 with epsilon=0.1 for a maximisation criteria

In [12]:
action = F.EpsilonGreedyMax(0, 0.1)
print(action)

NameError: name 'F' is not defined

We draw an action with *SoftMax* principle in state 2

In [13]:
action = F.SoftMax(2)
print(action)

NameError: name 'F' is not defined