# Markov Decision Processes (MDPs)

In this notebook, we'll explore **Markov Decision Processes (MDPs)** : the mathematical framework that forms the foundation of **Reinforcement Learning (RL)**.

An MDP provides a formal way to describe an environment in terms of **states, actions, rewards, and transitions**. It helps an RL agent decide the best sequence of actions to maximize cumulative rewards.

## 🎯 Learning Objectives

By the end of this notebook, you will be able to:
- Understand the **components** of an MDP.
- Define the **transition and reward functions**.
- Represent an **MDP as a mathematical model**.
- Compute **returns** and understand **Markov property**.

## 🔹 What is a Markov Decision Process?

A **Markov Decision Process (MDP)** is defined by a tuple:

$$
MDP = (S, A, P, R, \gamma)
$$

where:
- **S**: Set of possible states.
- **A**: Set of possible actions.
- **P(s' | s, a)**: Transition probability : the probability of moving from state `s` to `s'` after action `a`.
- **R(s, a)**: Reward received after performing action `a` in state `s`.
- **γ (gamma)**: Discount factor (0 ≤ γ ≤ 1), determines how much future rewards are valued.

## 🔄 The Markov Property

The process is **Markovian** if the next state depends **only on the current state and action**, not on any previous states.

$$
P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, ...) = P(s_{t+1} | s_t, a_t)
$$

This simplifies the modeling of environments and is the key assumption behind RL algorithms.

## 🧮 Example: Simple Grid World

Let's define a simple grid world where an agent can move **up, down, left, or right**. The goal is to reach the terminal state `G` while avoiding obstacles.

In [None]:
import numpy as np
import random

# Define states and actions
states = ['S', 'A', 'B', 'G']  # S=start, G=goal
actions = ['up', 'down', 'left', 'right']

# Define transitions and rewards (simplified)
transition_prob = {
    ('S', 'right'): 'A',
    ('A', 'right'): 'B',
    ('B', 'right'): 'G'
}

rewards = {
    'S': -1,
    'A': -1,
    'B': -1,
    'G': 10
}

# Simulate a simple trajectory
state = 'S'
total_reward = 0
trajectory = []

while state != 'G':
    action = 'right'
    next_state = transition_prob.get((state, action), state)
    reward = rewards[next_state]
    trajectory.append((state, action, reward, next_state))
    total_reward += reward
    state = next_state

trajectory, total_reward

### Output Interpretation
- The agent moves from **S → A → B → G**.
- The total reward is the **sum of intermediate penalties** and **final goal reward**.
- This simple structure forms the basis of **policy evaluation** and **learning** in RL.

## 🧠 Return and Value Functions

The **return** ($G_t$) is the total discounted reward from time step `t`:

$$
G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}
$$

The **value function** gives the expected return from a state `s` under a policy `π`:

$$
V_π(s) = E_π [G_t | S_t = s]
$$

The **action-value function** is:

$$
Q_π(s, a) = E_π [G_t | S_t = s, A_t = a]
$$

In [None]:
# Example: Compute discounted return
rewards_list = [1, 1, 1, 10]  # rewards over steps
gamma = 0.9

G = 0
for r in reversed(rewards_list):
    G = r + gamma * G

print(f"Total discounted return: {G:.2f}")

## 📘 Summary

- MDPs provide the **theoretical foundation** for RL problems.
- The **Markov property** ensures decisions depend only on the current state.
- Rewards and transition functions define the **environment's behavior**.
- **Value and Q-functions** estimate the long-term return for policies.

Next, we’ll explore how **policies** and **value iteration** help find optimal strategies for decision-making.