# Example usage

In this document we highlight how to use the functions of this package with a simple example described in [this website](https://artint.info/2e/html2e/ArtInt2e.Ch9.S5.SS2.html). The example is replicated here for ease of use:

Sam wants to decide whether to party or relax over the weekend and has a preference for partying, although she is worried about getting sick. 

We can model this problem as a Markov Decision Process (MDP) and solve the problem using value iteration.
In this example we are going to perform the same value iteration in a number of ways.
- In the first example we will provide python inputs to the value iteration algorithm. We will solve the problem in a synchronous manner. 
- The second example follows the same format as the first, however the problem is solved asynchronously.
- In the last example, we will input the same data, but using a `csv` format. The problem will then be solved with synchronous value iteration. This example shows how a user less familiar with Python can use this package and only needs to upload a `csv` file to input the data as opposed to writing Python code to generate the inputs.

## Import the package with pip install

First we install the package to show how it can be used.

In [1]:
%%capture
# Install required packages
! pip install git+https://github.com/CassandraDurr/value_iteration.git

## Import functions from the package

Next we need to import the functions that we are going to showcase.

In [2]:
from value_iteration import ValueIteration, AsynchValueIteration, load_mdp_from_csv, MDP

### Synchronous value iteration with python inputs

For **synchronous value iteration** we have update equation:
$$
Q(s, a) = \sum_{s'} P(s' | s, a) \left[ R(s, a, s') + \gamma V(s') \right] \\
V(s) = \max_a Q(s, a)
$$
where $s$ represents the current state, $a$ represents an action, and $s'$ represents the next state (upon taking action $a$ at state $s$). We have that $P(s' | s, a)$ is the probability of transitioning to state $s'$ upon taking action $a$ at state $s$, and $R(s, a, s')$ is the reward associated with the current state, action and next state.
The full algorithm is showcased in Section 9.5.2 of this [website](https://artint.info/2e/html2e/ArtInt2e.Ch9.S5.SS2.html).

In [3]:
# Example 1: Synchronous Value Iteration

# Create Markov Decision Process (MDP) inputs

# Define state space
S = ["healthy", "sick"]

# Define which actions ("relax" and "party") can occur at each state ("healthy" and "sick")
A = {
    "healthy": ["relax", "party"],
    "sick": ["relax", "party"],
}

# Define the transition probabilities. The key is in the form (current state, action, next state).
P = {
    ("healthy", "relax", "healthy"): 0.95,
    ("healthy", "relax", "sick"): 0.05,
    ("sick", "relax", "healthy"): 0.5,
    ("sick", "relax", "sick"): 0.5,
    ("healthy", "party", "healthy"): 0.7,
    ("healthy", "party", "sick"): 0.3,
    ("sick", "party", "healthy"): 0.1,
    ("sick", "party", "sick"): 0.9,
}

# Define the reward function. The key is in the form (current state, action, next state).
# For this example the reward is only dependent on the current state and the action.
# Therefore, there are duplicates.
# Sometimes the reward is also dependent on the next state and so we need to define it in this way.
R = {
    ("healthy", "relax", "healthy"): 7,
    ("healthy", "relax", "sick"): 7,
    ("healthy", "party", "healthy"): 10,
    ("healthy", "party", "sick"): 10,
    ("sick", "relax", "healthy"): 0,
    ("sick", "relax", "sick"): 0,
    ("sick", "party", "healthy"): 2,
    ("sick", "party", "sick"): 2,
}

# Create MDP data class
mdp = MDP(states=S, actions=A, probabilities=P, rewards=R)

# Setup value iteration class (synchronous version)
value_itr = ValueIteration(mdp=mdp, gamma=0.9, theta=1e-6, printing=True)

# Run value iteration algorithm
optimal_values, optimal_policy = value_itr.value_iteration()

# Display results
print("Optimal State Values:", optimal_values)
print("Optimal Policy:", optimal_policy)

Iteration 1, max value change: 10.0
Iteration 2, max value change: 6.84
Iteration 3, max value change: 5.227199999999996
Iteration 4, max value change: 4.537295999999998
Iteration 5, max value change: 4.053473279999995
Iteration 6, max value change: 3.642709190399998
Iteration 7, max value change: 3.277463254272007
Iteration 8, max value change: 2.949541425768956
Iteration 9, max value change: 2.654555692638418
Iteration 10, max value change: 2.3890944370749168
Iteration 11, max value change: 2.150183969833492
Iteration 12, max value change: 1.935165388614024
Iteration 13, max value change: 1.7416488165901214
Iteration 14, max value change: 1.5674839289618632
Iteration 15, max value change: 1.4107355349912183
Iteration 16, max value change: 1.2696619812986825
Iteration 17, max value change: 1.142695783134009
Iteration 18, max value change: 1.0284262048143304
Iteration 19, max value change: 0.9255835843317755
Iteration 20, max value change: 0.8330252258983961
Iteration 21, max value cha

### Asynchronous value iteration with python inputs

For **asynchronous value iteration** we have update equation:
$$
Q(s, a) = \sum_{s'} P(s' | s, a) \left[ R(s, a, s') + \gamma \max_{a'} Q(s', a') \right]
$$
using the same variable definitions as above. The full algorithm for asynchronous value iteration is detailed in Section 9.5.2 of this [website](https://artint.info/2e/html2e/ArtInt2e.Ch9.S5.SS2.html) after the synchronous version.

In [4]:
# Example 2: Asynchronous Value Iteration

# Create MDP data class
mdp = MDP(states=S, actions=A, probabilities=P, rewards=R)

# Setup value iteration class (asynchronous version)
value_itr = AsynchValueIteration(mdp=mdp, gamma=0.9, theta=1e-6, printing=True)

# Run value iteration algorithm
optimal_values, optimal_policy = value_itr.value_iteration()

# Display results
print("Optimal State Values:", optimal_values)
print("Optimal Policy:", optimal_policy)

Iteration 10, Average delta: 7.720360315245183
Iteration 11, Average delta: 7.524991770796127
Iteration 12, Average delta: 7.150101012658669
Iteration 13, Average delta: 6.44591983503207
Iteration 14, Average delta: 5.753595693127314
Iteration 15, Average delta: 5.225455083727316
Iteration 16, Average delta: 4.892726499805318
Iteration 17, Average delta: 4.599517888724458
Iteration 18, Average delta: 1.5984385050792924
Iteration 19, Average delta: 1.8667483465051107
Iteration 20, Average delta: 3.4214758108691257
Iteration 21, Average delta: 3.079605651773414
Iteration 22, Average delta: 3.555130618561811
Iteration 23, Average delta: 4.045704432570971
Iteration 24, Average delta: 4.354765935396741
Iteration 25, Average delta: 4.625720963384205
Iteration 26, Average delta: 5.194564444447569
Iteration 27, Average delta: 5.5264262868967
Iteration 28, Average delta: 5.859947623158648
Iteration 29, Average delta: 7.659181816375513
Iteration 30, Average delta: 5.9340352713848175
Iteration 31

### Synchronous value iteration with csv input

Lastly, we show how one can use this package without needing to write Python code to create the MDP inputs. This example uses a `csv` to describe the MDP. Essentially, the `csv` needs columns: `state`, `action`, `next_state`, `probability`, and `reward`. These need to describe the transition probabilities and rewards associated with each triplet $(s, a, s')$ (i.e. current state, action taken, next state). View the `csv` in the example data folder of the examples. 

In [5]:
# Example 3: Loading data from csv files

# Obtain states, actions, transition probabilities and rewards
S, A, P, R = load_mdp_from_csv(transitions_filepath="example_data/transitions.csv")

# Create MDP data class
mdp = MDP(states=S, actions=A, probabilities=P, rewards=R)

# Setup value iteration class (synchronous version)
value_itr = ValueIteration(mdp=mdp, gamma=0.9, theta=1e-6, printing=True)

# Run value iteration algorithm
optimal_values, optimal_policy = value_itr.value_iteration()

# Display results
print("Optimal State Values:", optimal_values)
print("Optimal Policy:", optimal_policy)

Iteration 1, max value change: 10.0
Iteration 2, max value change: 6.84
Iteration 3, max value change: 5.227199999999996
Iteration 4, max value change: 4.537295999999998
Iteration 5, max value change: 4.053473279999995
Iteration 6, max value change: 3.642709190399998
Iteration 7, max value change: 3.277463254272007
Iteration 8, max value change: 2.949541425768956
Iteration 9, max value change: 2.654555692638418
Iteration 10, max value change: 2.3890944370749168
Iteration 11, max value change: 2.150183969833492
Iteration 12, max value change: 1.935165388614024
Iteration 13, max value change: 1.7416488165901214
Iteration 14, max value change: 1.5674839289618632
Iteration 15, max value change: 1.4107355349912183
Iteration 16, max value change: 1.2696619812986825
Iteration 17, max value change: 1.142695783134009
Iteration 18, max value change: 1.0284262048143304
Iteration 19, max value change: 0.9255835843317755
Iteration 20, max value change: 0.8330252258983961
Iteration 21, max value cha

All three examples get to the same optimal policy - it is better to party when you're healthy and relax when you are sick!