# Lab Assignment 1: Understanding Markov Decision Processes and Value Iteration

## Objective
In this lab, you will explore the fundamentals of Markov Decision Processes (MDPs) using a transportation problem. You will implement and analyze the value iteration algorithm to find the optimal policy for reaching a destination.

## Background
The `TransportationMDP` class models a scenario where an agent moves from location 1 to location \(N\) (set to 27). The agent can either "walk" (move to \(state + 1\)) or take a "tram" (move to \(2 \times state\) with success probability 0.5, or stay in the same state with failure probability 0.5). Walking costs 1 unit, and taking the tram also costs 1 unit. The goal is to minimize the total cost to reach \(N\).

## Tasks
1. **Run the Provided Code**
   - Execute the code with \(N = 27\). Observe the output of the `valueIteration` function.
   - Note: Comment out `os.system('clear')` if it doesn’t work on your system.

2. **Modify Parameters**
   - Change `failProb` to 0.2 and rerun. Record the policy for states 1, 5, 10, and 20.
   - Change `walkCost` to 2 and rerun with `failProb = 0.5`. Record the policy for the same states.

3. **Implement a Utility Function**
   - Add `printPolicyAndValues` to print a table of state, value, and action.

4. **Analyze Convergence**
   - Modify `valueIteration` to track iterations until convergence.

## Questions
1. What does the optimal policy suggest about the trade-off between walking and taking the tram when `failProb = 0.5`? How does this change when `failProb = 0.2`?
2. Why does increasing `walkCost` to 2 affect the policy in certain states?
3. How does the discount factor (currently 1.0) influence the solution?
4. Is the policy always deterministic in this MDP? Why or why not?

## Deliverables
- Submit your modified code.
- Provide a short report (1-2 pages) answering the questions.

In [5]:
import os

class TransportationMDP(object):
    walkCost = 1
    tramCost = 1
    failProb = 0.5

    def __init__(self, N):
        self.N = N

    def startState(self):
        return 1

    def isEnd(self, state):
        return state == self.N

    def actions(self, state):
        results = []
        if state + 1 <= self.N:
            results.append('walk')
        if 2 * state <= self.N:
            results.append('tram')
        return results

    def succProbReward(self, state, action):
        results = []
        if action == 'walk':
            results.append((state + 1, 1, -self.walkCost))
        elif action == 'tram':
            results.append((state, self.failProb, -self.tramCost))
            results.append((2 * state, 1 - self.failProb, -self.tramCost))
        return results

    def discount(self):
        return 1.0

    def states(self):
        return list(range(1, self.N + 1))

def valueIteration(mdp):
    V = {}
    pi = {}
    def Q(state, action):
        return sum(prob * (reward + mdp.discount() * V.get(newState, 0))
                   for newState, prob, reward in mdp.succProbReward(state, action))

    for state in mdp.states():
        V[state] = 0

    while True:
        newV = {}
        for state in mdp.states():
            if mdp.isEnd(state):
                newV[state], pi[state] = 0, None
            else:
                newV[state], pi[state] = max((Q(state, action), action) for action in mdp.actions(state))
            print(f'{state}: {newV[state]} {pi[state]}')

        if max(abs(V[state] - newV[state]) for state in mdp.states()) < 1e-10:
            break
        V = newV

mdp = TransportationMDP(N=27)
valueIteration(mdp)

1: -1.0 walk
2: -1.0 walk
3: -1.0 walk
4: -1.0 walk
5: -1.0 walk
6: -1.0 walk
7: -1.0 walk
8: -1.0 walk
9: -1.0 walk
10: -1.0 walk
11: -1.0 walk
12: -1.0 walk
13: -1.0 walk
14: -1.0 walk
15: -1.0 walk
16: -1.0 walk
17: -1.0 walk
18: -1.0 walk
19: -1.0 walk
20: -1.0 walk
21: -1.0 walk
22: -1.0 walk
23: -1.0 walk
24: -1.0 walk
25: -1.0 walk
26: -1.0 walk
27: 0 None
1: -2.0 walk
2: -2.0 walk
3: -2.0 walk
4: -2.0 walk
5: -2.0 walk
6: -2.0 walk
7: -2.0 walk
8: -2.0 walk
9: -2.0 walk
10: -2.0 walk
11: -2.0 walk
12: -2.0 walk
13: -2.0 walk
14: -2.0 walk
15: -2.0 walk
16: -2.0 walk
17: -2.0 walk
18: -2.0 walk
19: -2.0 walk
20: -2.0 walk
21: -2.0 walk
22: -2.0 walk
23: -2.0 walk
24: -2.0 walk
25: -2.0 walk
26: -1.0 walk
27: 0 None
1: -3.0 walk
2: -3.0 walk
3: -3.0 walk
4: -3.0 walk
5: -3.0 walk
6: -3.0 walk
7: -3.0 walk
8: -3.0 walk
9: -3.0 walk
10: -3.0 walk
11: -3.0 walk
12: -3.0 walk
13: -2.5 tram
14: -3.0 walk
15: -3.0 walk
16: -3.0 walk
17: -3.0 walk
18: -3.0 walk
19: -3.0 walk
20: -3.0 wa