Importance sampling is a technique for simulating random variables. Suppose that $X$ is a random variable with density $\pi$, and suppose we wish to use simulation to estimate the probability that $X$ takes a value in some set $A$. If $X$ is difficult to simulate, or if the event that $X$ is in $A$ is very rare, then this might be hard to do.

Suppose that we have some other random variable $Y$ with density $\pi'$, such that $\pi'(x) > 0$ whenever $\pi(x) > 0$. Let $L$ be the likelihood ratio,
\begin{equation}
    L(y) = \frac{\pi(y)}{\pi'(y)}.
\end{equation}
Let $\gamma = \Pr(X \in A)$, and consider the estimator
\begin{equation}
    \hat{\gamma} = L(Y)\mathbb{1}[Y \in A]
\end{equation}
where $\mathbb{1}[\cdot]$ is the indicator function. We call $\pi'$ the twisted density.

We claim that $\hat{\gamma}$ is an unbiased estimator of $\gamma$.


To find the expected value of $\hat{\gamma}$, we integrate the estimator with respect to the probability density function of $Y$,
\begin{equation}
    E[\hat{\gamma}] = E[L(Y)\mathbb{1}[Y \in A]] = \int_A L(y)\pi'(y) \,dy.
\end{equation}
Now, substitute the definition of the likelihood ratio $L(y)$,
\begin{equation}
    E[\hat{\gamma}] = \int_A \frac{\pi(y)}{\pi'(y)}\pi'(y) \,dy \int_A = \pi(y) \,dy.
\end{equation}
By definition, the integral of the probability density function $\pi$ over the set $A$ is the probability that the random variable $X$ takes a value in $A$, which is precisely equal to $\gamma$. Thus, we have shown that $E[\hat{\gamma}] = \gamma$.

Based on the fact that $\hat{\gamma}$ is an unbiased estimator of $\gamma$, we can use a Monte Carlo simulation method to estimate $\gamma$. The law of large numbers states that the average of a large number of i.i.d. samples of a random variable converges to its expected value. The proceedure is as follows:

*   Draw a large number $n$ of independent samples $Y_1, \dots, Y_n$ from the importance distribution with density $\pi'(y)$.
*   For each sample $Y_i$, calculate a value for the estimator $\hat{\gamma}_i$:
    *   Compute the likelihood ratio $L(Y_i) = \pi(Y_i)/\pi'(Y_i)$;
    *   Check if the sample falls within the set $A$;
    *   Calculate the estimate $\hat{\gamma}_i = L(Y_i) \mathbb{1}[Y_i \in A]$. This will be $L(Y_i)$ if $Y_i \in A$ and $0$ otherwise.
*   The estimate of $\gamma$ is the sample mean of all the individual estimates $\hat{\gamma}_i$
\begin{equation}
    \hat{\gamma} = \frac{1}{n} \sum_{i = 1}^n \hat{\gamma}_i = \frac{1}{n} \sum_{i=1}^n L(Y_i) \mathbb{1}[Y_i \in A].
\end{equation}
As the number of samples $n$ increases, this estimated value will converge to the true value of $\gamma$.

This method is called importance sampling.

Now let $X$ be an exponential random variable with mean $3$, and consider the event $B = \{X > 30\}$. Suppose we wish to estimate $\Pr(B)$ by importance sampling, using as our twisted distribution an exponential with mean $\lambda^{-1}$.

We first calculate the exact probability of the event $B$. The probability density function (PDF) of an exponential distribution with mean $3$
\begin{equation}
    \pi(x) = \frac{1}{3} e^{-x/3}.
\end{equation}
The probability of $B$, is calculated by integrating the PDF,
\begin{equation}
    \Pr(B) = \int_{30}^\infty \frac{1}{3} e^{-x/3} \,dx = [-e^{-x/3}]_{30}^\infty = e^{-10} \approx 4.54 \cdot 10^{-5}.
\end{equation}
This is a very small probability, which makes it a good candidate for estimation using importance sampling, as standard Monte Carlo methods would require an extremely large number of samples to observe the event.

In [31]:
import numpy as np

def importance_sampling_estimator(lambda_val, num_samples=1000000):
    '''
    Estimates the probability P(X > 30) for an exponential random variable X
    with mean 3, using importance sampling.
    Args:
        lambda_val: The rate parameter for the twisted (proposal) exponential distribution.
        num_samples: The number of samples to generate for the estimation.
    Returns:
        A tuple containing the estimated probability and the variance of the estimates.
    '''
    # Generate uniform random variables to be transformed into exponential samples.
    uniform_samples = np.random.uniform(0, 1, num_samples)
    # Transform uniform samples to exponential samples.
    proposal_samples = -np.log(uniform_samples) / lambda_val

    # Identify which samples fall into the event of interest (B = {Y > 30}).
    samples_in_event_B = proposal_samples[proposal_samples > 30]
    if len(samples_in_event_B) == 0:
        return 0.0, 0.0

    # Calculate the likelihood ratio L(y) for the samples in B.
    # L(y) = (1 / (3 * lambda)) * exp(y * (lambda - 1/3))
    likelihood_ratios = (1 / (3 * lambda_val)) * np.exp(samples_in_event_B * (lambda_val - 1/3.0))
    estimated_prob = np.sum(likelihood_ratios) / num_samples

    # Calculate the variance of the estimates.
    squared_estimates = likelihood_ratios**2
    variance = (np.sum(squared_estimates) / num_samples) - (estimated_prob**2)
    return estimated_prob, variance


original_lambda = 1/3.0
true_probability = np.exp(-10)
lambda_values_to_test = [1/2, 1/3, 1/10]

print(f"True Probability P(X > 30) = e^(-10) ≈ {true_probability:.10f}")
print("-" * 70)
print(f"{'Lambda (λ)':<15} | {'Mean (1/λ)':<12} | {'Estimated P(B)':<20} | {'Variance':<15}")
print("-" * 70)

for l_val in lambda_values_to_test:
    mean_val = 1/l_val
    est_prob, var = importance_sampling_estimator(l_val)
    print(f"{l_val:<15.4f} | {mean_val:<12.2f} | {est_prob:<20.10f} | {var:<15.10f}")

True Probability P(X > 30) = e^(-10) ≈ 0.0000453999
----------------------------------------------------------------------
Lambda (λ)      | Mean (1/λ)   | Estimated P(B)       | Variance       
----------------------------------------------------------------------
0.5000          | 2.00         | 0.0000000000         | 0.0000000000   
0.3333          | 3.00         | 0.0000530000         | 0.0000529972   
0.1000          | 10.00        | 0.0000457811         | 0.0000000797   


*   $\lambda = 1/2$: Here, the proposal distribution has a mean of $2$ which is smaller than the original mean of $3$. Consequently, the samples are concentrated far away from the region of interest $Y > 30$. In a typical run, it is highly probable that zero samples will be greater than $30$, leading to an estimated probability of $0$. This is a very poor choice for $\lambda$.

*   $\lambda = 1/3$: This sets the proposal distribution to be the same as the original distribution and is equivalent to a standard Monte Carlo simulation. Because the event $\{X > 30\}$ is so rare, it is very likely that none or very few of the samples will fall into this region. The typical result is a very low estimate, close to $0$, which demonstrates the inefficiency of standard Monte Carlo for this problem.

*   $\lambda = 1/10$: The proposal distribution now has a mean of $10$. This is larger than the original mean, which means we are generating more samples in the tail of the distribution, closer to our region of interest. This choice of $\lambda$ produces an estimate that is very close to the true value of $e^{-10}$. The variance is small, but non-zero, indicating a good estimator.

---

To determine how long a simulation is required we shall calculate the relative error, which measures the estimator's standard deviation as a fraction of the estimate itself.

The standard deviation of our mean estimate is called the standard error, calculated as:
\begin{equation}
    \mathop{SE} = \frac{\sigma}{\sqrt{n}},
\end{equation}
where $\sigma$ is the standard deviation of a single estimate $\hat{\gamma}_i$ and $n$ is the number of samples. The relative error is
\begin{equation}
    \mathop{RE} = \frac{\mathop{SE}}{|\mu|} = \frac{\sigma}{|\mu|\sqrt{n}},
\end{equation}
where $\mu$ is the true value we are estimating. We can now set a target for our desired relative error and solve for the number of samples $n$ required to achieve it
\begin{equation}
    n = \left(\frac{\sigma}{|\mu|\mathop{RE}}\right)² = \frac{Var(\hat{\gamma})}{\mu^2 \mathop{RE}^2}.
\end{equation}
Since we don't know the true variance or the true mean beforehand, we can run a smaller simulation to obtain sample estimates for them. We then use these estimates to project the total number of samples needed:
\begin{equation}
    n \approx \frac{s^2}{\bar{x}^2 \mathop{RE}^2}.
\end{equation}

If the proposal mean $1/\lambda$ is too small, then this has the same problem as the original distribution. We will very rarely generate a sample greater than $30$, making the simulation extremely inefficient. The variance is low, but the estimate itself is often zero, so we need an enormous number of samples to get a stable result.

If the proposal mean $1/\lambda$ is too large, then we will generate almost all of our samples in the region $X > 30$. The likelihood ratio must correct for the fact that we are using a different distribution so it can become very large and vary significantly, which leads to high variance in the estimates. A high variance means we need more samples to get a reliable average.

In [32]:
import numpy as np

def estimate_required_samples(lambda_val, pilot_samples=10000, target_rel_error=0.01):
    '''
    Estimates the number of samples required to achieve a target relative error
    for a given lambda.
    Args:
        lambda_val: The rate parameter for the proposal distribution.
        pilot_samples: The number of samples for the initial pilot run.
        target_rel_error: The desired relative error for the final estimate.
    Returns:
        The estimated number of samples required.
    '''
    uniform_samples = np.random.uniform(0, 1, pilot_samples)
    proposal_samples = -np.log(uniform_samples) / lambda_val
    all_estimates = np.zeros(pilot_samples)

    in_event_indices = np.where(proposal_samples > 30)[0]
    samples_in_event_B = proposal_samples[in_event_indices]
    if len(samples_in_event_B) == 0:
        return float('inf')

    likelihood_ratios = (1 / (3 * lambda_val)) * np.exp(samples_in_event_B * (lambda_val - 1/3.0))
    all_estimates[in_event_indices] = likelihood_ratios
    sample_mean = np.mean(all_estimates)
    sample_variance = np.var(all_estimates)
    if sample_mean < 1e-15:
        return float('inf')

    required_n = sample_variance / (sample_mean**2 * target_rel_error**2)
    return required_n

true_probability = np.exp(-10)
lambda_values = 1.0 / np.linspace(5, 50, 46)

print(f"Estimating required sample size for a 1% relative error.")
print("-" * 65)
print(f"{'Lambda (λ)':<12} | {'Mean (1/λ)':<12} | {'Required Samples (n)':<25}")
print("-" * 65)

best_lambda = -1
min_samples = float('inf')

for l_val in lambda_values:
    required_n = estimate_required_samples(l_val)
    if required_n < min_samples:
        min_samples = required_n
        best_lambda = l_val

    print(f"{l_val:<12.4f} | {1/l_val:<12.2f} | {required_n:<25,.0f}")

print("-" * 65)
print(f"Optimal λ found: {best_lambda:.4f} (corresponding mean ≈ {1/best_lambda:.2f})")
print(f"Minimum estimated samples required: {min_samples:,.0f}")

Estimating required sample size for a 1% relative error.
-----------------------------------------------------------------
Lambda (λ)   | Mean (1/λ)   | Required Samples (n)     
-----------------------------------------------------------------
0.2000       | 5.00         | 3,884,181                
0.1667       | 6.00         | 2,424,512                
0.1429       | 7.00         | 1,063,282                
0.1250       | 8.00         | 637,474                  
0.1111       | 9.00         | 474,211                  
0.1000       | 10.00        | 383,743                  
0.0909       | 11.00        | 299,042                  
0.0833       | 12.00        | 260,515                  
0.0769       | 13.00        | 246,905                  
0.0714       | 14.00        | 214,731                  
0.0667       | 15.00        | 200,165                  
0.0625       | 16.00        | 189,767                  
0.0588       | 17.00        | 170,321                  
0.0556       | 18.00       

A simulation or its resulting estimator is considered useless if its variance is infinite. An estimator with infinite variance does not converge reliably, and the standard error does not decrease as the sample size increases, violating the conditions in the central limit theorem.

The variance of the importance sampling estimator $\hat{\gamma}$ is given by
\begin{equation}
    Var(\hat{\gamma}) = E[\hat{\gamma}^2] - E[\hat{\gamma}]^2.
\end{equation}
Since we already know  that $E[\hat{\gamma}] = \gamma = e^{-10} < \infty$, the variance is infinite if and only if the second moment, $E[\hat{\gamma}^2]$, is infinite. This is given by
\begin{align}
    E[\hat{\gamma}^2]
    &= \int_0^\infty \hat{\gamma}(y)^2 \pi'(y) \,dy \\
    &= \int_0^\infty \left(\frac{\pi(y)}{\pi'(y)} \mathbb{1}[y > 30]\right)^2 \pi'(y) \,dy \\
    &= ∫_{30}^\infty \frac{\pi(y)^2}{\pi'(y)} \,dy.
\end{align}
Now, substitute the formulas for the probability density functions $\pi(y) = e^{-y/3}/3$ and $\pi'(y) = \lambda e^{-\lambda y}$ to get
\begin{equation}
    E[\hat{\gamma}^2] = \frac{1}{9\lambda} \int_{30}^\infty e^{y(\lambda - 2/3)} \,dy.
\end{equation}
This integral will converge if and only if  $\lambda < 2/3$.

The optimal value of $\lambda$ is the one that minimises the variance of the estimator or equivalently, the second moment. Evaluating the above expression, we obtain
\begin{equation}
    E[\hat{\gamma}^2] = \frac{1}{9\lambda} \left[ \frac{1}{\lambda - 2/3} e^{y(\lambda - 2/3)} \right]_{30}^\infty =  \frac{1}{9\lambda(2/3 - \lambda)} e^{30(λ - 2/3)}.
\end{equation}
Let $f(\lambda) = E[\hat{\gamma}^2]$. Minimising $f(\lambda)$ is equivalent to minimising
\begin{equation}
    \log(f(\lambda)) = \log(1) - \log(9) - \log(\lambda) - \log(2/3 - \lambda) + 30(\lambda - 2/3).
\end{equation}
Now, differentiate with respect to $\lambda$
\begin{equation}
    \frac{d}{d\lambda} \log(f(\lambda)) = -\frac{1}{\lambda} + \frac{1}{2/3 - λ} + 30
\end{equation}
which has two possible solutions $\lambda = (33 \pm \sqrt{909}) / 90$ of which only one falls inside the range of finite variance $\lambda < 2/3$ which is given by $\lambda = (33 - √909) / 90 \approx 0.0317$.