In [1]:
%%html
<style>
table {align:left;display:block}
</style>

# Markov Decision Process (MDP)
----

**Value Iteration Process with Policy Changes in MDP**

We begin with a Markov Decision Process (MDP) where an agent decides whether to invest conservatively (C) or aggressively (A) in a financial portfolio. The objective is to find an optimal policy maximizing long-term rewards.

---

### **Step 1: Defining the MDP Components**

**States (S):**

- Low Wealth (L)
- Medium Wealth (M)
- High Wealth (H)

**Actions (A):**

- Conservative (C)
- Aggressive (A)

**Transition Probabilities:**

>| Current State | Action | Next State Probabilities     |
| ------------- | ------ | ---------------------------- |
| Low (L)       | C      | 80% Stay in L, 20% Move to M |
| Low (L)       | A      | 60% Stay in L, 40% Move to M |
| Medium (M)    | C      | 70% Stay in M, 30% Move to H |
| Medium (M)    | A      | 50% Stay in M, 50% Move to H |
| High (H)      | C      | 90% Stay in H, 10% Drop to M |
| High (H)      | A      | 70% Stay in H, 30% Drop to M |

**Rewards:**

- Low Wealth (L): -1
- Medium Wealth (M): 3
- High Wealth (H): 5

**Discount Factor (γ):** 0.9

---

### **Step 2: Value Iteration Updates**

We initialize values: $V_0(L) = 0$, $V_0(M) = 0$, $V_0(H) = 0$.

#### **Iteration 1**

Using Bellman’s equation:

>$
V_1(s) = \max_a \left[ R(s) + \gamma \sum_{s'} P(s' | s, a) V_0(s') \right]
$

For **Low Wealth (L):**

>$
V_1(L) = \max \left[ -1 + 0.9(0.8V_0(L) + 0.2V_0(M)), -1 + 0.9(0.6V_0(L) + 0.4V_0(M)) \right]
$

For **Medium Wealth (M):**

>$
V_1(M) = \max \left[ 3 + 0.9(0.7V_0(M) + 0.3V_0(H)), 3 + 0.9(0.5V_0(M) + 0.5V_0(H)) \right]
$

For **High Wealth (H):**

>$
V_1(H) = \max \left[ 5 + 0.9(0.9V_0(H) + 0.1V_0(M)), 5 + 0.9(0.7V_0(H) + 0.3V_0(M)) \right]
$

Since $V_0(L) = V_0(M) = V_0(H) = 0$, the initial values are just the rewards.

>$
V_1(L) = -1, \quad V_1(M) = 3, \quad V_1(H) = 5
$

#### **Policy Evaluation after Iteration 1**

> \$
Q(L, C) = -1 + 0.9(0.8(-1) + 0.2(3)) = -1.18
\$

> \$
Q(L, A) = -1 + 0.9(0.6(-1) + 0.4(3)) = -0.46
\$

> \$
(M, C) = 3 + 0.9(0.7(3) + 0.3(5)) = 6.24
\$

> \$
Q(M, A) = 3 + 0.9(0.5(3) + 0.5(5)) = 6.6
\$

> \$
Q(H, C) = 5 + 0.9(0.9(5) + 0.1(3)) = 9.32
\$

> \$
Q(H, A) = 5 + 0.9(0.7(5) + 0.3(3)) = 8.96
\$

**Policy at Iteration 1:**
- L → Conservative (C)
- M → Aggressive (A)
- H → Conservative (C)


#### **Iteration 2**

Updating $V_2(s)$:

>$
V_2(L) = \max \left[ -1 + 0.9(0.8(-1) + 0.2(3)), -1 + 0.9(0.6(-1) + 0.4(3)) \right]
$

>$
V_2(H) = \max \left[ 5 + 0.9(0.9(5) + 0.1(3)), 5 + 0.9(0.7(5) + 0.3(3)) \right]
$

Computing these:

>$
V_2(L) = -0.46, \quad V_2(M) = 6.6, \quad V_2(H) = 9.32
$

#### **Policy Evaluation after Iteration 2**


##### From state is: L

> \$
Q(L, C) = -1 + 0.9(0.8(-0.46) + 0.2(6.6)) = -0.1432
>\$

> \$
Q(L, A) = -1 + 0.9(0.6(-0.46) + 0.4(6.6)) = 1.1276
\$

##### From state is: M

> \$
Q(M, C) = 3 + 0.9(0.7(6.6) + 0.3(9.32)) = 9.6744
\$

> \$
Q(M, A) = 3 + 0.9(0.5(6.6) + 0.5(9.32)) = 10.164
\$

##### From state is: H

> \$
Q(H, C) = 5 + 0.9(0.9(9.32) + 0.1(6.6)) = 13.1432
\$

> \$
Q(H, A) = 5 + 0.9(0.7(9.32) + 0.3(6.6)) = 12.6536
\$

**Policy at Iteration 2:**
- L → Aggressive (A)
- M → Aggressive (A)
- H → Conservative (C)

#### **Iteration 3**

Updating $V_3(s)$:

>$
V_3(L) = \max \left[ -1 + 0.9(0.8(-0.46) + 0.2(6.6)), -1 + 0.9(0.6(-0.46) + 0.4(6.6)) \right]
$

>$
V_3(H) = \max \left[ 3 + 0.9(0.7(6.6) + 0.3(9.32)), 3 + 0.9(0.5(6.6) + 0.5(9.32)) \right]
$

>$
V_3(M) = \max \left[ 5 + 0.9(0.9(9.32) + 0.1(6.6)), 5 + 0.9(0.7(9.32) + 0.3(6.6)) \right]
$

Computing these:

>$
V_3(L) = 1.1276, \quad V_3(M) = 10.164, \quad V_3(H) = 13.1432
$

#### **Policy Change Analysis**

From **Iteration 2 to Iteration 3**, let’s check the action values to determine if the policy changed.

For **Low Wealth (L):**

>$
$Q(L, C) = -1 + 0.9(0.8(1.1276) + 0.2(10.164)) = 1.641392$
$

>$
Q(L, A) = -1 + 0.9(0.6(1.1276) + 0.4(10.164)) = 3.267944
$

For **Medium Wealth (M):**

>$
Q(M, C) = 3 + 0.9(0.7(10.164) + 0.3(13.1432)) = 12.951984
$

>$
Q(M, A) = 3 + 0.9(0.5(10.164) + 0.5(13.1432)) = 13.48824
$

For **High Wealth (H):**

>$
Q(H, C) = 5 + 0.9(0.9(13.1432) + 0.1(10.164)) = 16.560752
$

>$
Q(H, A) = 5 + 0.9(0.7(13.1432) + 0.3(10.164)) = 16.024496
$

Since $Q(L, A) > Q(L, C)$ and $Q(H, C) > Q(H, A)$, the policy updates to:

- **Low Wealth (L)** → Aggressive (A)
- **Medium Wealth (M)** → Aggressive (A)
- **High Wealth (H)** → Conservative (C)



### Summary: Policy Evolution Over Iterations

>| State  | Iteration 1 | Iteration 2 | Iteration 3 |
|--------|-------------|------------|------------|
| Low    | -1          | -0.46      | 3.267944   |
| Medium | 3           | 6.6        | 13.48824   |
| High   | 5           | 9.32       | 16.560752  |

This analysis shows how the agent is wealth changes over iterations. If the agent in the "low" state consistently makes the best decisions, their score improves from -1 to 3.26 by the third step, allowing them to keep increasing their wealth. Similarly, agents in the "medium" and "high" states experience continuous score growth when making optimal choices
