## Finite MDP in Python
**Author:** Lauren Washington

In [16]:
import gym
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import mdptoolbox
import mdptoolbox.example
from gym import wrappers

%matplotlib inline

## Environment Exploration

In [17]:
#create gym evironment
env = gym.envs.make("CartPole-v0")

In [18]:
env.spec.max_episode_steps

200

In [19]:
env.spec.timestep_limit

200

In [20]:
env.action_space.n

2

In [21]:
env.observation_space

Box(4,)

## Model Exploration

A MDP solved using the finite-horizon backwards induction algorithm.

Parameters:	

**transitions (array)** – Transition probability matrices. See the documentation for the MDP class for details.  
**reward (array**) – Reward matrices or vectors. See the documentation for the MDP class for details.  
**discount (float)** – Discount factor. See the documentation for the MDP class for details.  
**N (int)** – Number of periods. Must be greater than 0.  
**h (array, optional)** – Terminal reward. Default: a vector of zeros.  
**skip_check (bool)** – By default we run a check on the transitions and rewards arguments to make sure they describe a valid MDP. You can set this argument to True in order to skip this check.  
**Attributes (Data) –**  
————— –  
**V (array)** – Optimal value function. Shape = (S, N+1). V[:, n] = optimal value function at stage n with stage in {0, 1...N-1}. V[:, N] value function for terminal stage.  
**policy (array)** – Optimal policy. policy[:, n] = optimal policy at stage n with stage in {0, 1...N}. policy[:, N] = policy for stage N.  
**time (float)** – used CPU time  

In [22]:
#P = transtion
#if agent tries to move left in s1 there is a 10% probability 
#they will stay in s1 and 90% probability they will move to s2
#Agent is in       S1   S2    S3    S4
#left = array ([[  0.1   0.9   0.   0. ]
#                [ 0.1   0.    0.9  0. ]
#                [ 0.1   0.    0.   0.9]
#                [ 0.1   0.    0.   0.9]])

#Agent is in      S1    S2   S3  S4
#right =array ([[ 1.   0.   0.   0. ]
#               [ 1.   0.   0.   0. ]
#               [ 1.   0.   0.   0. ]
#               [ 1.   0.   0.   0. ]])

#R = reward
#enter S1 from anywhere get a reward of 0 
#S2 and S3 reward of 1 and S4 reward of 2
#[[ 0.  0.]
#[ 0.  1.]
#[ 0.  1.]
#[ 2.  2.]]
P, R = mdptoolbox.example.forest(4,2)
print(P)
print(R)

[[[ 0.1  0.9  0.   0. ]
  [ 0.1  0.   0.9  0. ]
  [ 0.1  0.   0.   0.9]
  [ 0.1  0.   0.   0.9]]

 [[ 1.   0.   0.   0. ]
  [ 1.   0.   0.   0. ]
  [ 1.   0.   0.   0. ]
  [ 1.   0.   0.   0. ]]]
[[ 0.  0.]
 [ 0.  1.]
 [ 0.  1.]
 [ 2.  2.]]


In [23]:
fh = mdptoolbox.mdp.FiniteHorizon(P, R, 0.9, 3)
fh.run()

In [24]:
print(fh.V)
print(fh.policy)

[[ 0.8829  0.81    0.      0.    ]
 [ 1.729   1.      1.      0.    ]
 [ 3.0051  1.62    1.      0.    ]
 [ 5.0051  3.62    2.      0.    ]]
[[0 0 0]
 [1 1 1]
 [0 0 1]
 [0 0 0]]
