# Example usage

## Import the package with pip install

In [1]:
%%capture
# Install required packages
! pip install git+https://github.com/CassandraDurr/value_iteration.git

## Import functions from the package 
In this example we are going to perform the same value iteration in a number of ways.
- In the first example we will provide python inputs to the value iteration algorithm. We will solve the problem in a synchronous fashion. 
- The second example follows the same format as the first however the problem is solved asynchronously.
- In the last example we will input the same data, but from `csv` format, and we will solve the problem with synchronous value iteration.

In [2]:
from value_iteration import ValueIteration, AsynchValueIteration, load_mdp_from_csv, MDP

### Synchronous value iteration with python inputs

For **synchronous value iteration** we have update equation:
$$
Q(s, a) = \sum_{s'} P(s' | s, a) \left[ R(s, a, s') + \gamma V(s') \right] \\
V(s) = \max_a Q(s, a)
$$

In [3]:
# Example 1: Create Markov Decision Process (MDP) inputs

# Define state space
S = ["healthy", "sick"]

# Define which actions ("relax" and "party") can occur at each state ("healthy" and "sick")
A = {
    "healthy": ["relax", "party"],
    "sick": ["relax", "party"],
}

# Define the transition probabilities. The key is in the form (current state, action, next state).
P = {
    ("healthy", "relax", "healthy"): 0.95,
    ("healthy", "relax", "sick"): 0.05,
    ("sick", "relax", "healthy"): 0.5,
    ("sick", "relax", "sick"): 0.5,
    ("healthy", "party", "healthy"): 0.7,
    ("healthy", "party", "sick"): 0.3,
    ("sick", "party", "healthy"): 0.1,
    ("sick", "party", "sick"): 0.9,
}

# Define the reward function. The key is in the form (current state, action, next state).
# For this example the reward is only dependent on the current state and the action.
# Therefore, there are duplicates.
# Sometimes the reward is also dependent on the next state and so we need to define it in this way.
R = {
    ("healthy", "relax", "healthy"): 7,
    ("healthy", "relax", "sick"): 7,
    ("healthy", "party", "healthy"): 10,
    ("healthy", "party", "sick"): 10,
    ("sick", "relax", "healthy"): 0,
    ("sick", "relax", "sick"): 0,
    ("sick", "party", "healthy"): 2,
    ("sick", "party", "sick"): 2,
}

# Create MDP data class
mdp = MDP(states=S, actions=A, probabilities=P, rewards=R)

# Setup value iteration class (synchronous version)
value_itr = ValueIteration(mdp=mdp, gamma=0.9, theta=1e-6, printing=True)

# Run value iteration algorithm
optimal_values, optimal_policy = value_itr.value_iteration()

# Display results
print("Optimal State Values:", optimal_values)
print("Optimal Policy:", optimal_policy)

Iteration 1, max value change: 10.0
Iteration 2, max value change: 6.84
Iteration 3, max value change: 5.227199999999996
Iteration 4, max value change: 4.537295999999998
Iteration 5, max value change: 4.053473279999995
Iteration 6, max value change: 3.642709190399998
Iteration 7, max value change: 3.277463254272007
Iteration 8, max value change: 2.949541425768956
Iteration 9, max value change: 2.654555692638418
Iteration 10, max value change: 2.3890944370749168
Iteration 11, max value change: 2.150183969833492
Iteration 12, max value change: 1.935165388614024
Iteration 13, max value change: 1.7416488165901214
Iteration 14, max value change: 1.5674839289618632
Iteration 15, max value change: 1.4107355349912183
Iteration 16, max value change: 1.2696619812986825
Iteration 17, max value change: 1.142695783134009
Iteration 18, max value change: 1.0284262048143304
Iteration 19, max value change: 0.9255835843317755
Iteration 20, max value change: 0.8330252258983961
Iteration 21, max value cha

### Asynchronous value iteration with python inputs

For **asynchronous value iteration** we have update equation:
$$
Q(s, a) = \sum_{s'} P(s' | s, a) \left[ R(s, a, s') + \gamma \max_{a'} Q(s', a') \right]
$$


In [4]:
# Example 2: Create Markov Decision Process (MDP) inputs

# Create MDP data class
mdp = MDP(states=S, actions=A, probabilities=P, rewards=R)

# Setup value iteration class (asynchronous version)
value_itr = AsynchValueIteration(mdp=mdp, gamma=0.9, theta=1e-6, printing=True)

# Run value iteration algorithm
optimal_values, optimal_policy = value_itr.value_iteration()

# Display results
print("Optimal State Values:", optimal_values)
print("Optimal Policy:", optimal_policy)

Iteration 10, Average delta: 8.015620463763986
Iteration 11, Average delta: 7.015620463763986
Iteration 12, Average delta: 7.473941947625209
Iteration 13, Average delta: 5.654012872377015
Iteration 14, Average delta: 5.324789456252878
Iteration 15, Average delta: 4.643024230690085
Iteration 16, Average delta: 4.017718550333898
Iteration 17, Average delta: 4.639517027609466
Iteration 18, Average delta: 6.64658615924116
Iteration 19, Average delta: 6.9233716732041986
Iteration 20, Average delta: 6.94797713782655
Iteration 21, Average delta: 7.0474025723400855
Iteration 22, Average delta: 6.695214811536337
Iteration 23, Average delta: 6.466235332315624
Iteration 24, Average delta: 5.873449898928752
Iteration 25, Average delta: 5.664385977267812
Iteration 26, Average delta: 5.771510186244635
Iteration 27, Average delta: 4.853691490365113
Iteration 28, Average delta: 2.220393042620095
Iteration 29, Average delta: 1.6803457900587535
Iteration 30, Average delta: 1.4811754158078134
Iteration 3

### Synchronous value iteration with csv input

In [5]:
# Example 3: Loading data from csv files

# Obtain states, actions, transition probabilities and rewards
S, A, P, R = load_mdp_from_csv(transitions_filepath="example_data/transitions.csv")

# Create MDP data class
mdp = MDP(states=S, actions=A, probabilities=P, rewards=R)

# Setup value iteration class (synchronous version)
value_itr = ValueIteration(mdp=mdp, gamma=0.9, theta=1e-6, printing=True)

# Run value iteration algorithm
optimal_values, optimal_policy = value_itr.value_iteration()

# Display results
print("Optimal State Values:", optimal_values)
print("Optimal Policy:", optimal_policy)

Iteration 1, max value change: 10.0
Iteration 2, max value change: 6.84
Iteration 3, max value change: 5.227199999999996
Iteration 4, max value change: 4.537295999999998
Iteration 5, max value change: 4.053473279999995
Iteration 6, max value change: 3.642709190399998
Iteration 7, max value change: 3.277463254272007
Iteration 8, max value change: 2.949541425768956
Iteration 9, max value change: 2.654555692638418
Iteration 10, max value change: 2.3890944370749168
Iteration 11, max value change: 2.150183969833492
Iteration 12, max value change: 1.935165388614024
Iteration 13, max value change: 1.7416488165901214
Iteration 14, max value change: 1.5674839289618632
Iteration 15, max value change: 1.4107355349912183
Iteration 16, max value change: 1.2696619812986825
Iteration 17, max value change: 1.142695783134009
Iteration 18, max value change: 1.0284262048143304
Iteration 19, max value change: 0.9255835843317755
Iteration 20, max value change: 0.8330252258983961
Iteration 21, max value cha