### **Q4 — Off-Policy Monte Carlo**

#### **Problem Statement**

In this problem, we use the same gridworld environment from Problem 3 and estimate the value function using **off-policy Monte Carlo methods with Importance Sampling**.

- The **behavior policy**, b(a|s), is a fixed random policy.
- The **target policy**, π(a|s), is a greedy policy with respect to the learned action-value function Q(s,a).
- The discount factor is γ = 0.9.

Two off-policy Monte Carlo estimators are implemented:

1. Ordinary Importance Sampling (OIS)
2. Weighted Importance Sampling (WIS)

For reference, we also compute the optimal value function using **Value Iteration** and compare the results.

#### **Gridworld Environment**

We use a **5 × 5 gridworld**, identical to the environment used in **Problem 3**.

Environment details:

- Grid size: 5 rows × 5 columns
- Terminal (goal) state: bottom-right cell
- Blocked (unreachable) states: same positions as in Problem 3
- Available actions: Up, Down, Left, Right
- Reward: −1 for each step
- An episode terminates when the agent reaches the goal state

#### **Policies**

**Behavior Policy**

The behavior policy b(a|s) selects actions uniformly at random from the action set. This policy is fixed and is used only to generate episodes.

**Target Policy**

The target policy π(a|s) is greedy with respect to the current estimate of Q(s,a). If multiple actions share the same maximum value, one is selected randomly.

#### **Episode Generation**

Episodes are generated by starting from a random non-terminal state and following the behavior policy b(a|s).

Each episode consists of a sequence of:
(state, action, reward)

until the goal state is reached or a maximum number of steps is exceeded.

#### **Value Matrix and Greedy Policy Display**

At selected episode indices k, we record:

- The estimated state-value function V_k, displayed as a **5 × 5 matrix**
- The greedy policy derived from Q_k, shown using directional arrows

This presentation matches the value tables and policy diagrams shown in the lecture slides for the 5 × 5 gridworld.

#### **Logger**

During training, snapshots of the current estimates are written to log files under:

`./logs/q4/`

Each snapshot includes:
- The episode index k
- The current value matrix V_k
- The greedy policy corresponding to Q_k

#### **Initialization of Q and Weight Tracking**

We maintain the following data structures:

- Q(s,a): action-value estimates
- C(s,a): cumulative weights for Weighted Importance Sampling
- N(s,a): counters used for Ordinary Importance Sampling

All values are initialized to zero.

#### **Off-policy Monte Carlo Control (Weighted Importance Sampling)**

For each episode generated under the behavior policy, returns are computed backward from the end of the episode.

Weighted Importance Sampling is used to update Q(s,a), which reduces variance by normalizing the importance weights using cumulative sums.

If the behavior action at a state does not match the greedy target action, the update for that episode stops early.

#### **Off-policy Monte Carlo Control (Ordinary Importance Sampling)**

Ordinary Importance Sampling uses the same importance ratios but applies them directly when updating Q(s,a).

This method is unbiased but typically has higher variance compared to Weighted Importance Sampling.

An incremental averaging form is used to match the pseudocode presented in the lecture notes.

#### **Running Both Algorithms and Writing Logs**

Both Ordinary IS and Weighted IS are run for the same number of episodes.

Separate log files are created for each method so their convergence behavior and stability can be compared directly.

#### **Value Iteration Baseline**

Value Iteration is used to compute the optimal value function for the 5 × 5 gridworld.

The resulting value matrix and greedy policy serve as a baseline for evaluating the Monte Carlo estimates obtained using off-policy learning.

#### **Importing the Q4 Implementation**

In this notebook, we **reuse the Python implementation written for Question 4** (`off_policy_mc.py`) instead of re-implementing all algorithms inside the notebook.

This approach keeps the notebook **clean, readable, and consistent** with the actual code used to generate the results.

To allow the notebook to access the `src/` folder, we first add the **project root** to Python’s module search path. This makes it possible to directly import the Q4
Gridworld environment and Monte Carlo algorithms.

### What is being imported

- **GridworldConfig**  
  Defines the grid size, discount factor (γ), rewards, goal state, and episode settings.

- **GridworldMDP**  
  Implements the same **5×5 Gridworld** used in Q3 and Q4, including state transitions   and reward-on-arrival logic.

- **OffPolicyMCControl**  
  Contains the two off-policy Monte Carlo control methods used in Q4:
  - Ordinary Importance Sampling (OIS)
  - Weighted Importance Sampling (WIS)

- **run_q4_variations**  
  A helper function that runs:
  - Value Iteration (baseline)
  - Off-policy Monte Carlo with OIS
  - Off-policy Monte Carlo with WIS  

  and returns all results in a structured format for comparison and analysis.

By separating **implementation (Python file)** from **analysis (notebook)**, the notebook can focus on results, comparison, and interpretation rather than
low-level algorithm details.

In [15]:
import sys
from pathlib import Path
import numpy as np

# Add project root to Python path (so src/ works)
PROJECT_ROOT = Path.cwd().parent  # adjust if notebook is deeper
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

from src.q4.off_policy_mc import (
    GridworldConfig,
    GridworldMDP,
    OffPolicyMCControl,
    run_q4_variations,
)

#### **Importing the Q4 Implementation**

In this notebook, we reuse the Python implementation written for Question 4 (off_policy_mc.py) instead of rewriting all algorithms inside the notebook.

This approach keeps the notebook:
* clean and easy to read
* consistent with the actual code used to generate results
* aligned with good software practice (one implementation, reused everywhere)

To allow the notebook to access the src/ folder, we first add the project root to Python’s module search path. This makes it possible to directly import the Gridworld environment and Monte Carlo algorithms used in Q4.

**What this code does**
* Adds the project root directory to sys.path so Python can locate src/
* Imports the core Q4 components needed for running experiments and analyzing results

**What is being imported**
* *GridworldConfig*
Defines the grid size, discount factor (γ), rewards, goal state, and episode settings.

* *GridworldMDP*
Implements the same 5 × 5 Gridworld environment used in Q3 and Q4, including state transitions and reward-on-arrival logic.

* *OffPolicyMCControl*
Contains the two off-policy Monte Carlo control methods used in Q4:
   - Ordinary Importance Sampling (OIS)
   - Weighted Importance Sampling (WIS)

* *run_q4_variations*
A helper function that runs:
   - Value Iteration (baseline)
   - Off-policy Monte Carlo with OIS
   - Off-policy Monte Carlo with WIS

and returns all results in a structured format for comparison and analysis.

By separating the implementation (Python file) from the analysis (notebook), the notebook can focus on results, comparison, and interpretation rather than low-level algorithm details.

In [16]:
import sys
from pathlib import Path
import numpy as np

# Add project root to Python path (so src/ works)
PROJECT_ROOT = Path.cwd().parent  # adjust if notebook is deeper
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

from src.q4.off_policy_mc import (
    GridworldConfig,
    GridworldMDP,
    OffPolicyMCControl,
    run_q4_variations,
)

#### **Value Iteration Baseline**

Before running the off-policy Monte Carlo methods, we first compute a Value Iteration (VI) baseline for the same 5×5 Gridworld.

Value Iteration assumes that the full environment is known (state transitions and rewards). It repeatedly updates all states using the Bellman optimality rule until the values converge. Because the gridworld is small and deterministic, this process converges very quickly.

This baseline serves two purposes:
* It provides the optimal value function (V*) for the gridworld
* It gives us a reference policy to compare against the Monte Carlo results

**What this cell outputs**
* Number of iterations needed for Value Iteration to converge
* Runtime, showing how fast model-based methods can be
* The optimal value matrix (V*), displayed in grid form
* The greedy optimal policy (π*), shown using arrow symbols

**Why this matters for Q4**

The off-policy Monte Carlo methods in Q4 do not know the environment model and must learn only from sampled episodes. By comparing their learned values and policies against this Value Iteration baseline, we can clearly see:
* how close Monte Carlo learning gets to the optimal solution
* the effect of variance in Ordinary vs. Weighted Importance Sampling

This makes Value Iteration an essential benchmark, even though Q4 itself focuses on Monte Carlo learning.

In [17]:
mdp = GridworldMDP(cfg)

V_vi, pi_vi, it_vi, t_vi = mdp.value_iteration()

print("=== Value Iteration Baseline ===")
print("iters:", it_vi, "time(s):", round(t_vi, 6))
print(mdp.format_V(V_vi, decimals=2))
print("\nPolicy:")
print(mdp.format_pi(pi_vi))

=== Value Iteration Baseline ===
iters: 9 time(s): 0.000413
 -0.43    0.63    1.81    3.12    4.58*
  0.63    1.81    3.12    4.58    6.20 
  1.81    3.12    4.58*   6.20    8.00 
  3.12*   4.58    6.20    8.00   10.00 
  4.58    6.20    8.00   10.00     G   

Policy:
 →   →   →   ↓   ↓ 
 →   →   →   →   ↓ 
 →   ↓   →   →   ↓ 
 →   →   →   →   ↓ 
 →   →   →   →   G 


This function executes the entire Q4 workflow using the configuration defined earlier:
* Value Iteration (baseline)
* Off-policy Monte Carlo using Ordinary Importance Sampling (OIS)
* Off-policy Monte Carlo using Weighted Importance Sampling (WIS)

The function uses the same grid size, discount factor, episode count, and snapshot schedule defined in GridworldConfig, ensuring consistency across all methods.

**What this step does**
* Runs Value Iteration to compute the optimal value function and policy
* Runs Off-policy Monte Carlo (OIS) using randomly generated episodes
* Runs Off-policy Monte Carlo (WIS) using the same episode settings
* Writes detailed log files and CSV snapshots to logs/q4/
* Collects all results into a single output object for later analysis

**Output**

The returned object contains:
* Value function and policy from Value Iteration
* Value functions and policies from OIS and WIS
* Runtime information
* Paths to the generated log and CSV files

This allows the notebook to focus on comparison and interpretation, while the full computation is handled inside the Python implementation.

In [18]:
out = run_q4_variations(
    episodes=cfg.episodes,
    gamma=cfg.gamma,
    max_steps=cfg.max_steps,
    snapshot_episodes=cfg.snapshot_episodes,
    log_dir="logs/q4",
)

* **Off-policy Monte Carlo Results (OIS vs WIS)**

In this section, we display the final results of the off-policy Monte Carlo control methods used in Question 4 and compare their learned value functions and policies.

Both methods learn from sampled episodes, rather than using full knowledge of the environment rules.

* **Ordinary Importance Sampling (OIS)**

The table below shows the state-value function and greedy policy learned using Ordinary Importance Sampling (OIS).

OIS applies raw importance weights when reusing data collected from a random behavior policy.
Because these weights can grow very large, the resulting value estimates often show large spikes and instability, even after many episodes.

This behavior is expected and highlights the high variance problem of ordinary importance sampling in practice.

* **Weighted Importance Sampling (WIS)**

The second table shows the state-value function and greedy policy learned using Weighted Importance Sampling (WIS).

WIS improves stability by normalizing importance weights, which smooths learning and prevents extreme updates.
As a result, the learned values are stable and closely match the Value Iteration baseline from earlier.

This demonstrates why WIS is preferred over OIS for off-policy Monte Carlo control in finite-sample settings.

In [19]:
print("=== Off-policy MC (Ordinary IS) ===")
print(mdp.format_V(out["OIS"]["V"], decimals=2))
print("\nPolicy:")
print(mdp.format_pi(out["OIS"]["pi"]))

print("\n=== Off-policy MC (Weighted IS) ===")
print(mdp.format_V(out["WIS"]["V"], decimals=2))
print("\nPolicy:")
print(mdp.format_pi(out["WIS"]["pi"]))

=== Off-policy MC (Ordinary IS) ===
 -2.51   -3.44   -2.78   -3.88   -3.05*
 -3.66   16.77   -2.62   51.10   -2.54 
 -2.50   18.85   35.74*  48.59   10.30 
 -3.79*   0.94   13.97   17.15   10.58 
 -3.47   -3.12   21.08   59.30     G   

Policy:
 ←   ↑   →   ↓   ↓ 
 ↓   ↓   →   ↓   ↓ 
 ↑   ↑   ↓   ↓   ↓ 
 ↑   →   →   →   → 
 ←   →   ↓   ↓   G 

=== Off-policy MC (Weighted IS) ===
 -0.44    0.62    1.80    3.10    4.56*
  0.62    1.79    3.10    4.56    6.18 
  1.80    3.10    4.56*   6.18    7.99 
  3.11*   4.56    6.18    7.99   10.00 
  4.55    6.18    7.99   10.00     G   

Policy:
 →   ↓   →   ↓   ↓ 
 →   →   →   ↓   ↓ 
 →   ↓   →   ↓   ↓ 
 →   →   ↓   →   ↓ 
 →   →   →   →   G 


#### **Evidence: Comparing Weighted MC to Value Iteration**

In this cell, we quantitatively compare the final value function learned using Weighted Importance Sampling (WIS) with the optimal value function obtained from Value Iteration (VI).

Since Value Iteration computes the optimal solution directly using the known environment rules, it serves as a reference baseline. If the WIS method is working correctly, its learned values should be very close to the VI values, even though WIS learns only from sampled episodes.

To measure this closeness, we compute:
* *Maximum absolute difference:* max∣VWIS​−VVI​∣, This shows the largest error at any state.
* *Mean absolute difference:*    mean(∣VWIS​−VVI​∣), This shows the average error across all states.

**Interpretation of the result**
* A small maximum difference indicates that WIS has converged very close to the optimal solution.
* The low mean difference confirms that this agreement holds across the entire grid, not just a few states.
* This provides strong numerical evidence that Weighted Importance Sampling is stable and effective, even without knowing the full MDP model.

In contrast, Ordinary Importance Sampling (OIS) does not achieve this level of closeness due to its high variance, which is why WIS is preferred in practice.

In [20]:
V_wis = out["WIS"]["V"]
V_vi  = out["VI"]["V"]

max_diff = float(np.max(np.abs(V_wis - V_vi)))
mean_diff = float(np.mean(np.abs(V_wis - V_vi)))

print("=== Evidence (WIS vs VI) ===")
print(f"max |V_WIS - V_VI|  = {max_diff:.4f}")
print(f"mean|V_WIS - V_VI| = {mean_diff:.4f}")

=== Evidence (WIS vs VI) ===
max |V_WIS - V_VI|  = 0.0276
mean|V_WIS - V_VI| = 0.0144


#### **Comparison Table**

| Method                   | How it learns                                     | Needs full model? | Stability   | Speed     | Main takeaway                                     |
| ------------------------ | ------------------------------------------------- | ----------------- | ----------- | --------- | ------------------------------------------------- |
| **Value Iteration (VI)** | Uses equations and known rules of the environment | Yes               | Very stable | Very fast | Knows the map → walks straight to the best answer |
| **Off-policy MC (OIS)**  | Learns from random trial-and-error episodes       | No                | Unstable    | Slow      | Raw weighting causes big value spikes             |
| **Off-policy MC (WIS)**  | Learns from episodes but averages carefully       | No                | Stable      | Slow      | Smoothed learning → ends up close to VI           |

* Value Iteration (VI): Solves the problem by repeatedly applying equations using full knowledge of the environment.
* Ordinary Importance Sampling (OIS): Learns from random experiences but can become unstable due to large correction weights.
* Weighted Importance Sampling (WIS): Improves OIS by normalizing weights, resulting in smoother and more reliable learning.

#### **Final Comparison**

In this question, we solved the same 5×5 gridworld problem using off-policy Monte Carlo learning and compared it against a Value Iteration (VI) baseline. While all methods aim to estimate the optimal value function, they differ significantly in how they learn.

Value Iteration uses full knowledge of the environment, including the reward function and transition rules. Because of this, it converges very quickly and produces smooth, stable values across the grid. In our experiment, VI converged in only a few iterations and served as a reliable reference solution.

In contrast, Monte Carlo methods do not use the environment model. Instead, they learn purely from sampled episodes generated by a random behavior policy. This makes Monte Carlo more realistic for situations where the model is unknown, but it also introduces higher variability.

The Ordinary Importance Sampling (OIS) approach showed clear instability. Since it applies raw importance weights, some state values became extremely large. This behavior is expected and highlights a known weakness of OIS: variance can grow rapidly when the behavior policy differs from the target policy.

The Weighted Importance Sampling (WIS) method addressed this issue by normalizing importance weights. As a result, WIS produced a stable value function that closely matched the Value Iteration solution. The maximum difference between WIS and VI was approximately 0.0276, which is very small and provides strong evidence that WIS converges to the correct solution with enough samples.

Overall, this experiment demonstrates that while Monte Carlo methods are slower and noisier than Value Iteration, Weighted Importance Sampling provides a reliable and model-free alternative that can closely approximate optimal values when sufficient experience is available.