# Reinforcement Learning with Monte Carlo, SARSA, and Exploration-Exploitation

## Objective
In this lab, you will extend the `TransportationMDP` problem into reinforcement learning (RL). You will implement Monte Carlo and SARSA algorithms, compare their performance, and explore the exploration-exploitation trade-off.

## Background
Unlike value iteration, RL methods learn from experience. You’ll simulate an agent moving through the transportation problem without knowing `failProb` or exact outcomes, using Monte Carlo (model-free) and SARSA (temporal-difference) methods.

## Tasks
1. **Simulate Episodes**
   - Write `simulateEpisode(mdp, policy, max_steps=100)` to return a list of (state, action, reward) tuples.

2. **Implement Monte Carlo**
   - Create `monteCarlo(mdp, num_episodes=1000, epsilon=0.1)` using an \(\epsilon\)-greedy strategy. Compute \(Q(s, a)\) and derive the policy.

3. **Implement SARSA**
   - Write `sarsa(mdp, num_episodes=1000, alpha=0.1, epsilon=0.1)` to update \(Q(s, a)\) incrementally.

4. **Exploration vs. Exploitation**
   - Run Monte Carlo and SARSA with \(\epsilon = 0.01\) and \(\epsilon = 0.5\). Record average cumulative reward over the last 100 episodes.

## Questions
1. How does the Monte Carlo policy compare to value iteration?
2. What are the key differences between Monte Carlo and SARSA?
3. How does changing \(\epsilon\) affect the policy?
4. Which method would you choose for a real transportation system, and why?

## Deliverables
- Submit your code.
- Provide a report (2-3 pages) comparing policies and answering questions."

In [None]:
import os
import random

class TransportationMDP(object):
    walkCost = 1
    tramCost = 1
    failProb = 0.5

    def __init__(self, N):
        self.N = N

    def startState(self):
        return 1

    def isEnd(self, state):
        return state == self.N

    def actions(self, state):
        results = []
        if state + 1 <= self.N:
            results.append('walk')
        if 2 * state <= self.N:
            results.append('tram')
        return results

    def succProbReward(self, state, action):
        results = []
        if action == 'walk':
            results.append((state + 1, 1, -self.walkCost))
        elif action == 'tram':
            results.append((state, self.failProb, -self.tramCost))
            results.append((2 * state, 1 - self.failProb, -self.tramCost))
        return results

    def discount(self):
        return 1.0

    def states(self):
        return list(range(1, self.N + 1))

# Placeholder for student implementations
def simulateEpisode(mdp, policy, max_steps=100):
    pass

def monteCarlo(mdp, num_episodes=1000, epsilon=0.1):
    pass

def sarsa(mdp, num_episodes=1000, alpha=0.1, epsilon=0.1):
    pass

mdp = TransportationMDP(N=27)
# Test your implementations here