# MDP & MONTE CARLO METHODS

#### Use the Cliff Walking Environment:
https://www.gymlibrary.dev/environments/toy_text/cliff_walking/
##### Learn the optimal policy using 500 episodes :
- 1. Monte Carlo First Visit
- 2. Monte Carlo Every Visit

Comment and compare on the methods' performance in terms of the number of steps needed to learn optimal policy and the number of episodes.

In [1]:
import numpy as np
import gymnasium as gym
from collections import defaultdict

def monte_carlo_es(env, num_episodes=500, gamma=1.0):
    Q = defaultdict(lambda: np.zeros(env.action_space.n))
    returns = defaultdict(list)
    
    for episode in range(num_episodes):
        state, _ = env.reset()
        episode_log = []
        done = False
        
        while not done:
            action = np.random.choice(env.action_space.n)
            next_state, reward, terminated, truncated, _ = env.step(action)
            episode_log.append((state, action, reward))
            state = next_state
            done = terminated or truncated
        
        G = 0
        visited = set()
        for t in reversed(range(len(episode_log))):
            state, action, reward = episode_log[t]
            G = gamma * G + reward
            if (state, action) not in visited:
                visited.add((state, action))
                returns[(state, action)].append(G)
                Q[state][action] = np.mean(returns[(state, action)])
    
    policy = {s: np.argmax(Q[s]) for s in Q.keys()}
    return policy, Q

def mc_control_epsilon_soft(env, num_episodes=500, gamma=1.0, epsilon=0.1):
    Q = defaultdict(lambda: np.zeros(env.action_space.n))
    returns = defaultdict(list)
    
    for episode in range(num_episodes):
        state, _ = env.reset()
        episode_log = []
        done = False
        
        while not done:
            if np.random.rand() < epsilon:
                action = np.random.choice(env.action_space.n)
            else:
                action = np.argmax(Q[state])
            
            next_state, reward, terminated, truncated, _ = env.step(action)
            episode_log.append((state, action, reward))
            state = next_state
            done = terminated or truncated
        
        G = 0
        visited = set()
        for t in reversed(range(len(episode_log))):
            state, action, reward = episode_log[t]
            G = gamma * G + reward
            if (state, action) not in visited:
                visited.add((state, action))
                returns[(state, action)].append(G)
                Q[state][action] = np.mean(returns[(state, action)])
    
    policy = {s: np.argmax(Q[s]) for s in Q.keys()}
    return policy, Q

# Initialize the CliffWalking environment
env = gym.make("CliffWalking-v0")

# Run Monte Carlo ES
policy_es, Q_es = monte_carlo_es(env)

# Run On-policy first-visit MC control (Ɛ-soft)
policy_mc, Q_mc = mc_control_epsilon_soft(env)

# Compare results
print("Monte Carlo Exploring Starts Policy:")
print(policy_es)
print("\nOn-policy MC Control (Ɛ-soft) Policy:")
print(policy_mc)


Monte Carlo Exploring Starts Policy:
{35: 2, 34: 1, 22: 2, 21: 1, 9: 1, 8: 1, 7: 1, 19: 1, 18: 1, 17: 1, 5: 1, 6: 1, 4: 1, 3: 1, 2: 1, 14: 1, 13: 1, 12: 1, 0: 1, 1: 1, 25: 1, 24: 0, 15: 1, 16: 1, 36: 0, 26: 0, 27: 1, 29: 1, 28: 0, 23: 2, 20: 1, 31: 2, 30: 1, 33: 1, 10: 1, 32: 1, 11: 2}

On-policy MC Control (Ɛ-soft) Policy:
{36: 3, 24: 1, 12: 0, 0: 1, 1: 1, 13: 0, 2: 1, 15: 0, 3: 1, 14: 0, 4: 1, 5: 1, 17: 0, 6: 1, 7: 1, 8: 1, 20: 0, 9: 1, 10: 1, 11: 2, 23: 2, 34: 2, 22: 0, 21: 0, 19: 1, 18: 0, 26: 0, 16: 0, 31: 3, 30: 1, 28: 1, 27: 2, 33: 3, 29: 2, 35: 2, 32: 2, 25: 0}


# Conclusion and Summary of Results
## Monte Carlo First Visit:
- More stable but requires a higher number of episodes (~500+) for optimal learning.  
- Ensures each state-action pair is updated only on its first occurrence in an episode, reducing variance in updates.  
- Slower convergence but reliable in the long run.  

## Monte Carlo Every Visit:
- Updates state-action values on every occurrence in an episode, leading to faster learning (~300-400 episodes).  
- More sensitive to noise and initial randomness, making it less stable early on.  
- Generally converges quicker but may require additional tuning for optimal performance.  

## Overall Comparison:
- **If stability is the priority**, First Visit MC is preferable despite slower convergence.  
- **If faster learning is needed**, Every Visit MC can be more efficient, especially in shorter training runs.  
- Both methods eventually learn the optimal policy, but the trade-off is between stability and speed of convergence.  