# Lab Assignment 1: Understanding Markov Decision Processes and Value Iteration

## Objective
In this lab, you will explore the fundamentals of Markov Decision Processes (MDPs) using a transportation problem. You will implement and analyze the value iteration algorithm to find the optimal policy for reaching a destination.

## Background
The `TransportationMDP` class models a scenario where an agent moves from location 1 to location \(N\) (set to 27). The agent can either "walk" (move to \(state + 1\)) or take a "tram" (move to \(2 \times state\) with success probability 0.5, or stay in the same state with failure probability 0.5). Walking costs 1 unit, and taking the tram also costs 1 unit. The goal is to minimize the total cost to reach \(N\).

## Tasks
1. **Run the Provided Code**
   - Execute the code with \(N = 27\). Observe the output of the `valueIteration` function.
   - Note: Comment out `os.system('clear')` if it doesn’t work on your system.

2. **Modify Parameters**
   - Change `failProb` to 0.2 and rerun. Record the policy for states 1, 5, 10, and 20.
   - Change `walkCost` to 2 and rerun with `failProb = 0.5`. Record the policy for the same states.

3. **Implement a Utility Function**
   - Add `printPolicyAndValues` to print a table of state, value, and action.

4. **Analyze Convergence**
   - Modify `valueIteration` to track iterations until convergence.

## Questions
1. What does the optimal policy suggest about the trade-off between walking and taking the tram when `failProb = 0.5`? How does this change when `failProb = 0.2`?
2. Why does increasing `walkCost` to 2 affect the policy in certain states?
3. How does the discount factor (currently 1.0) influence the solution?
4. Is the policy always deterministic in this MDP? Why or why not?

## Deliverables
- Submit your modified code.
- Provide a short report (1-2 pages) answering the questions.

In [1]:
import os

class TransportationMDP(object):
    walkCost = 1
    tramCost = 1
    failProb = 0.5

    def __init__(self, N):
        self.N = N

    def startState(self):
        return 1

    def isEnd(self, state):
        return state == self.N

    def actions(self, state):
        results = []
        if state + 1 <= self.N:
            results.append('walk')
        if 2 * state <= self.N:
            results.append('tram')
        return results

    def succProbReward(self, state, action):
        results = []
        if action == 'walk':
            results.append((state + 1, 1, -self.walkCost))
        elif action == 'tram':
            results.append((state, self.failProb, -self.tramCost))
            results.append((2 * state, 1 - self.failProb, -self.tramCost))
        return results

    def discount(self):
        return 1.0

    def states(self):
        return list(range(1, self.N + 1))

def valueIteration(mdp):
    V = {}
    pi = {}
    def Q(state, action):
        return sum(prob * (reward + mdp.discount() * V.get(newState, 0))
                   for newState, prob, reward in mdp.succProbReward(state, action))

    for state in mdp.states():
        V[state] = 0

    while True:
        newV = {}
        for state in mdp.states():
            if mdp.isEnd(state):
                newV[state], pi[state] = 0, None
            else:
                newV[state], pi[state] = max((Q(state, action), action) for action in mdp.actions(state))
            # print(f'{state}: {newV[state]} {pi[state]}')

        if max(abs(V[state] - newV[state]) for state in mdp.states()) < 1e-10:
            break
        V = newV
    return pi,V

mdp = TransportationMDP(N=27)
valueIteration(mdp)

({1: 'walk',
  2: 'walk',
  3: 'tram',
  4: 'walk',
  5: 'walk',
  6: 'tram',
  7: 'walk',
  8: 'walk',
  9: 'walk',
  10: 'walk',
  11: 'walk',
  12: 'walk',
  13: 'tram',
  14: 'walk',
  15: 'walk',
  16: 'walk',
  17: 'walk',
  18: 'walk',
  19: 'walk',
  20: 'walk',
  21: 'walk',
  22: 'walk',
  23: 'walk',
  24: 'walk',
  25: 'walk',
  26: 'walk',
  27: None},
 {1: -9.999999999882107,
  2: -8.999999999938439,
  3: -7.9999999999678835,
  4: -7.99999999999477,
  5: -6.999999999997328,
  6: -5.999999999998636,
  7: -8.99999999999909,
  8: -7.999999999999545,
  9: -6.999999999999773,
  10: -5.999999999999886,
  11: -4.999999999999943,
  12: -3.9999999999999716,
  13: -2.999999999999986,
  14: -13.0,
  15: -12.0,
  16: -11.0,
  17: -10.0,
  18: -9.0,
  19: -8.0,
  20: -7.0,
  21: -6.0,
  22: -5.0,
  23: -4.0,
  24: -3.0,
  25: -2.0,
  26: -1.0,
  27: 0})

The initial output shows a policy that primarily chooses walking for most states, with occasional tram  choices at states like 3, 6, and 13.

Trade-off between walking and tram:

with failProb=0.5: Policy prefers walking in most cases as tram has 50% failure chance

with failProb=0.2: Policy shifts toward more tram usage as it becomes more reliable

Effect of increasing walkCost to 2:

makes walking relatively more expensive, causing policy to prefer tram in more states

particularly affects states where tram can make significant progress

Discount factor influence:

current discount=1.0 means no future discounting - all future costs are equally important


Policy determinism:

the policy is always deterministic in this MDP because:

value iteration produces deterministic policies for this type of MDP

at each state, one action strictly dominates the other in expected value

no action has exactly equal expected values in any state


This code models an MDP for a simple transportation scenario:

you are at position state on a number line from 1 to N.

your goal is to reach state N as cheaply as possible.

At each state, you can either:

walk: Move from state to state + 1, always succeeds, with a fixed cost.

tram: Tries to go from state to 2 * state. But this action may fail: with probability failProb, you stay in the same place, paid the tram's cost anyway; with probability 1 - failProb, you successfully move to 2 * state.



Class Attributes

walkCost = 1

tramCost = 1

failProb = 0.5

__init__(self, N)
stores the maximum state (N), the destination.

startState(self)
starting point on the line (1).

isEnd(self, state)
are you at your goal (state N)?

actions(self, state)
returns possible moves:

'walk' (to next state)

'tram' (to 2*state, if you don't overshoot N)
only gives actions that are possible from given state.

succProbReward(self, state, action)
for a given action, returns a list of (new_state, probability, reward) tuples:

walk: Deterministic, always goes to state + 1, cost is -walkCost.

tram: With 50% chance, stay (failProb, cost -tramCost). Otherwise (1-failProb), go to 2*state (-tramCost).

discount(self)
return 1.0, i.e., no discounting (future and current rewards equally valued).

states(self)
return list of all possible states.

Value Iteration Algorithm

this is an iterative dynamic programming algorithm that computes the optimal value function V and policy pi for reaching state N in the cheapest way.

Initialize V: set V=0 for all states.

While not converged:

For each state:

if at goal, V[state] = 0.

else, for every possible action, compute the expected value using transition model and the running value function.

Take the action with the maximum expected value; that's the greedy policy.

Convergence: When the value function stops changing much (tolerance < 1e-10).

Returns: The optimal policy and value function.



Instantiate the MDP for N=27.

Run value iteration.

Output is the optimal action at each state (policy) and minimum expected cost to reach N from each state (value function).