# Stochastic MDP Grid World Demo

This notebook demonstrates the implementation and usage of a **stochastic Markov Decision Process (MDP)** in a grid world environment with obstacles and a goal state.

We will:
- Load the MDP environment
- Visualize the optimal policy learned by value iteration
- Simulate a rollout starting from a user-defined position
- Visualize the policy and the path on the grid


In [None]:
# Import necessary modules
import numpy as np
import matplotlib.pyplot as plt
import stochastic_mdp_grid_world as mdp  # Make sure this file is in your working directory

## Grid World Setup

The grid world consists of a 5x6 grid with some obstacles and a goal state with a high reward.

- Obstacles are cells that cannot be traversed.
- The goal state gives a high reward.
- Each step incurs a penalty.

Let's look at the grid configuration:


In [None]:
print(f"Grid World Size: {mdp.grid_world}")
print(f"Obstacles at: {mdp.OBSTACLES}")
print(f"Goal State: {mdp.GOAL_STATE}")

## Optimal Policy Computation

The code uses value iteration to compute the optimal policy for navigating the grid world under stochastic dynamics.

The policy is encoded in the Q-values matrix.

Let's visualize the optimal policy using arrows that indicate the best action to take at each state.


In [None]:
mdp.plot_arrows_MDP_STOC(mdp.Q_values, mdp.grid_world, mdp.actions, mdp.states, mdp.GOAL_STATE, OBSTACLES=mdp.OBSTACLES)


## Simulate Policy Rollout

We can simulate a deterministic rollout of the learned policy starting from any valid start state.

Let's pick a start state and simulate the path taken to reach the goal state.


In [None]:
# Choose a start state (not an obstacle)
start_state = (4, 0)
assert start_state not in mdp.OBSTACLES, "Start state is inside an obstacle!"

start_index = mdp.states.index(start_state)
path = mdp.simulate_policy_MDP_STOC(mdp.Q_values, mdp.states, mdp.actions, start_index, mdp.GOAL_INDEX)

print(f"Simulated path from {start_state} to goal:")
print(path)


## Visualize Policy with Rollout Path

Let's visualize the optimal policy again, but this time overlay the rollout path from the start state to the goal.

The start state is marked with a flag, and each step in the path is shown with step numbers.


In [None]:
mdp.plot_arrows_MDP_STOC(mdp.Q_values, mdp.grid_world, mdp.actions, mdp.states, mdp.GOAL_STATE, OBSTACLES=mdp.OBSTACLES, path=path)


## Full Grid Visualization

Finally, let's visualize the grid world including obstacles, goal, policy arrows, and the rollout path using matplotlib.

This provides a clear and intuitive view of the environment and agent's policy.


In [None]:
mdp.plot_grid_world_MDP_STOC(mdp.states, mdp.grid_world, mdp.OBSTACLES, mdp.GOAL_STATE, path=path, Q_values=mdp.Q_values, actions=mdp.actions)


# Summary

- We set up a stochastic grid world MDP with obstacles and a goal.
- Used value iteration to compute optimal Q-values and policy.
- Simulated a deterministic policy rollout from a given start state.
- Visualized the policy and the rollout path both as text-based arrows and a graphical grid plot.

This modular design makes it easy to experiment with different grid sizes, obstacles, rewards, and policies.
