---

## **Introduction**

### **Importance Sampling**

Importance sampling refers to a method for estimating the expected value of a distribution \( A \), using samples drawn from another distribution \( B \).

In this approach, a **weighting factor**, known as the **importance sampling ratio**, is applied to the sampled values to adjust for the difference between the two distributions. The variance of this method depends on how similar the two distributions are.

---

#### Example

Suppose the distribution \( A \) always selects the action **"UP"** with 100% probability, while the distribution \( B \) selects **"UP"** with only 1% probability. In this case, the importance sampling ratio becomes:

$$
\text{Ratio} = \frac{P_A(\text{UP})}{P_B(\text{UP})} = \frac{1}{0.01} = 100
$$

Thus, the sampled value is scaled by 100, leading to a significant increase in its weight.

---

### Key Insight

When the two distributions \( A \) and \( B \) are similar, the probabilities for actions will be closer, resulting in a **lower ratio**. A smaller ratio reduces the variance of the estimate, making it more stable and efficient. Conversely, a large discrepancy between the distributions leads to higher variance, which can degrade the quality of the estimation.

---

💡 **Takeaway**  
The closer the probability distributions \( A \) and \( B \) are, the better the performance of importance sampling due to reduced variance in the weighted estimates.

---

<br>

As the distributions diverge, the variance increases, leading to greater errors in the estimation of the expected value. This phenomenon can be observed from the following results:

In [54]:
import numpy as np

def simulate_dice(target_policy, behavior_policy, num_epochs):
  weighted_face = []
  for episode in range(total_episode):
    face = np.random.choice(dice_faces, p = behavior_policy)
    idx = np.where(dice_faces == face)
    rho = target_policy[idx] / behavior_policy[idx]
    weighted_face.append(face * rho)
  mean = np.mean(weighted_face)
  var = np.var(weighted_face)
  return mean, var

In [57]:
import numpy as np
# Define dice faces (1 to 6)
dice_faces = np.array([1, 2, 3, 4, 5, 6])

# Define the behavior policy: biased dice with 70% probability for face 6
behavior_policy = np.array([0.3/5, 0.3/5, 0.3/5, 0.3/5, 0.3/5, 0.7])

# Define the target policy: biased dice with 50% probability for face 1
target_policy = np.array([0.5, 0.1, 0.1, 0.1, 0.1, 0.1])

# Number of episodes to simulate
total_episode = 1000

mean, var = simulate_dice(target_policy, behavior_policy, total_episode)
true_expected_value = np.sum(dice_faces * target_policy)

print(f"🎲 Estimated Expected Value: {mean:.4f}")
print(f"🎲True Expcted Value: {true_expected_value}")
print(f"📊 Variance: {var:.4f}")
print(f"📊 Error: {np.abs((mean-true_expected_value)/true_expected_value) * 100:.4f} %")


🎲 Estimated Expected Value: 2.6046
🎲True Expcted Value: 2.5
📊 Variance: 7.7602
📊 Error: 4.1848 %


In [58]:
dice_faces = np.array([1, 2, 3, 4, 5, 6])
behavior_policy = np.array([0.6, 0.4/5, 0.4/5, 0.4/5, 0.4/5, 0.4/5])
target_policy = np.array([0.5, 0.1, 0.1, 0.1, 0.1, 0.1])

total_episode = 1000

mean, var = simulate_dice(target_policy, behavior_policy, total_episode)
true_expected_value = np.sum(dice_faces * target_policy)

print(f"🎲 Estimated Expected Value: {mean:.4f}")
print(f"🎲True Expcted Value: {true_expected_value}")
print(f"📊 Variance: {var:.4f}")
print(f"📊 Error: {np.abs((mean-true_expected_value)/true_expected_value) * 100:.4f} %")


🎲 Estimated Expected Value: 2.4471
🎲True Expcted Value: 2.5
📊 Variance: 5.1474
📊 Error: 2.1167 %


In [59]:
dice_faces = np.array([1, 2, 3, 4, 5, 6])
behavior_policy = np.array([0, 0.2, 0.2, 0.2, 0.2, 0.2])
target_policy = np.array([1, 0, 0, 0, 0, 0])

total_episode = 100

mean, var = simulate_dice(target_policy, behavior_policy, total_episode)

print(f"🎲 Estimated Expected Value: {mean:.4f}")
print(f"🎲True Expcted Value: 3.5")
print(f"📊 Variance: {var:.4f}")
print(f"📊 Error: {np.abs((mean-true_expected_value)/true_expected_value) * 100:.4f} %")


🎲 Estimated Expected Value: 0.0000
🎲True Expcted Value: 3.5
📊 Variance: 0.0000
📊 Error: 100.0000 %


---

## **Conclusion**

From the results above, we confirm that when the two probability distributions are more similar, the **variance decreases**, and the **error in the estimated expected value** reduces.

However, in the last case, if actions generated by the behavior policy are not sampled in the target policy, the **importance sampling ratio** will always be 0, and the estimated expected value will consequently be 0.

To prevent this, the behavior policy must assign a non-zero probability to any action that has a non-zero probability in the target policy. This requirement is called the **coverage assumption**.


<br>

---