# Assignment 2: Estimation Theory

*Author:* Thomas Adler

*Copyright statement:* This  material,  no  matter  whether  in  printed  or  electronic  form,  may  be  used  for  personal  and non-commercial educational use only.  Any reproduction of this manuscript, no matter whether as a whole or in parts, no matter whether in printed or in electronic form, requires explicit prior acceptance of the authors.

## Exercise 1: Fisher Information of the Likelihood Function

Suppose we draw $n$ i.i.d samples  $\{x_1, \ldots, x_n \}$ from a random variable $X$ with pmf $f(x; \theta)$, where $\theta$ is the parameter. The likelihood function is
\begin{align*}
\mathcal{L}(\{x_1, \ldots, x_n \}; \theta) &= \prod^n_{i=1} f(x_i; \theta) . \\
\end{align*}

Assuming you know the Fisher Information $I_F(\theta)$ of one sample under $f(x; \theta)$, calculate the Fisher Information $I_F^{\mathcal L}(\theta)$ of $n$ iid samples, i.e., of the likelihood function. 
Assume that all regularity conditions are met for $f$ and $\mathcal{L}$, i.e., you can use the identities
\begin{equation*}
I_F^{\mathcal L}(\theta) = - \mathbb{E}_X \Big[ \frac{\partial^2}{\partial \theta^2} \ln \mathcal L(\{x_1, \ldots, x_n \}; \theta) \Big] \quad \text{and} \quad I_F(\theta) = - \mathbb{E}_X \Big[ \frac{\partial^2}{\partial \theta^2} \ln f(x; \theta) \Big].
\end{equation*}
Interpret the result!

########## YOUR SOLUTION HERE ##########

##### Step 1: Compute the log-likelihood:

$$
\ln \mathcal{L}(\{x_1, \ldots, x_n \}; \theta) = \ln \left( \prod^n_{i=1} f(x_i; \theta) \right) = \sum^n_{i=1} \ln f(x_i; \theta).
$$

##### Step 2: Differenciate the log-likelihood w.r.t. $\theta$

$$
\frac{\partial^2}{\partial \theta^2} \ln \mathcal L(\{x_1, \ldots, x_n \}; \theta) = \sum^n_{i=1} \frac{\partial^2}{\partial \theta^2} \ln f(x_i; \theta).
$$

##### Step 3: Compute Fisher Information
$$
I_F^{\mathcal L}(\theta) = - \mathbb{E}_X \left[ \sum^n_{i=1} \frac{\partial^2}{\partial \theta^2} \ln f(x_i; \theta) \right].
$$

Given that $I_F(\theta) = - \mathbb{E}_X \left[ \frac{\partial^2}{\partial \theta^2} \ln f(x; \theta) \right]$ is the Fisher Information of one sample, and since our samples are iid, each term in the summation contributes equally, so the expression simplifies to:

$$
I_F^{\mathcal L}(\theta) = n \cdot I_F(\theta).
$$

##### Comments:
The results show that the amount of information about the parameter $\theta$ in our sample grows linearly with the number of iid samples $n$. In practical terms, as the sample size increases, our estimate of $\theta$ becomes more precise, reducing the variance of the estimator.

## Exercise 2: Fisher Information of the Bernoulli Distribution

Consider a Bernoulli distributed random variable $X$ with probability mass function (pmf)
\begin{align*}
f(x ; p) &= p^x(1-p)^{1-x} , \\
\end{align*}
where $0 \leq p \leq 1$ is the parameter and the support is $x \in \{0,1\}$. 
Calculate the Fisher information. 
Assume $f$ fulfills all necessary regularity conditions so that you may use the form
\begin{align*}
I_F(p) &= -\mathbb{E}_X \Big[ \frac{d^2}{dp^2} \ln f(x ; p) \Big]  \ .
\end{align*}

########## YOUR SOLUTION HERE ##########

To calculate the Fisher Information $I_F(p)$ for a Bernoulli distributed random variable with pmf $f(x; p) = p^x(1-p)^{1-x}$, where $x \in \{0,1\}$ and $0 \leq p \leq 1$, we start by taking the logarithm of the pmf, then find its second derivative with respect to $p$, and finally compute its expected value as required.

##### Step 1: Compute the logarithm of the pmf:
$$
\ln f(x; p) = x\ln(p) + (1-x)\ln(1-p).
$$

##### Step 2: First derivative w.r.t. $p$
$$
\frac{d}{dp} \ln f(x ; p) = \frac{x}{p} - \frac{1-x}{1-p}.
$$

##### Step 3: Second derivative w.r.t. $p$
$$
\frac{d^2}{dp^2} \ln f(x ; p) = -\frac{x}{p^2} - \frac{1-x}{(1-p)^2}.
$$

##### Step 4: Compute Fisher Information
Fisher Information is defined as the negative expected value of this second derivative, so:
$$
I_F(p) = -\mathbb{E}_X \left[ -\frac{x}{p^2} - \frac{1-x}{(1-p)^2} \right] = \mathbb{E}_X \left[ \frac{x}{p^2} + \frac{1-x}{(1-p)^2} \right].
$$

Because $X$ is Bernoulli distributed, we know that $E(X) = p$. Therefore, we can calculate the expected value directly:
$$
I_F(p) = p \cdot \frac{1}{p^2} + (1-p) \cdot \frac{1}{(1-p)^2} = \frac{1}{p} + \frac{1}{1-p}.
$$

Upon simplifying this expression, we get:
$$
I_F(p) = \frac{1}{p(1-p)}.
$$

##### Comments
The Fisher Information has its max at $p=0.5$ and the minimum at $p=0$ or $p=1$, which means that if an event is sure or impossible, we have less information w.r.t. when the even have some uncertainties. 

## Exercise 3: Fisher Information of the Poisson Distribution

Consider a Poisson distributed random variable $X$ with pmf
\begin{align*}
p(x ; \lambda) &= \frac{\lambda^x}{x!} e^{-\lambda} \ , \\ 
\end{align*}
where $\lambda$ is the parameter and the support is $x \in \mathbb{N} \cup 0$.
Calculate the Fisher information. Again, assume that all necessary regularity conditions hold, i.e. you can use the form
\begin{align*}
I_F(\lambda) &= -\mathbb{E}_X \Big[ \frac{d^2}{d \lambda^2} \ln p(x ; \lambda) \Big] \ .
\end{align*}

########## YOUR SOLUTION HERE ##########

To calculate the Fisher Information $I_F(\lambda)$ for a Poisson distributed random variable with the probability mass function (pmf) $p(x ; \lambda) = \frac{\lambda^x e^{-\lambda}}{x!}$, where $\lambda > 0$ is the parameter and $x \in \mathbb{N} \cup \{0\}$, we start by taking the natural logarithm of the pmf, then find its second derivative with respect to $\lambda$, and finally compute its expected value as required.

##### Step 1: Compute the Logarithm of the pmf
$$
\ln p(x ; \lambda) = x\ln(\lambda) - \lambda - \ln(x!).
$$

##### Step 2: First Derivative with Respect to $\lambda$
$$
\frac{d}{d\lambda} \ln p(x ; \lambda) = \frac{x}{\lambda} - 1.
$$

##### Step 3: Second Derivative with Respect to $\lambda$
$$
\frac{d^2}{d\lambda^2} \ln p(x ; \lambda) = -\frac{x}{\lambda^2}.
$$

##### Step 4: Compute Fisher Information
Given the Fisher Information formula,
$$
I_F(\lambda) = -\mathbb{E}_X \left[ \frac{d^2}{d\lambda^2} \ln p(x ; \lambda) \right] = -\mathbb{E}_X \left[ -\frac{x}{\lambda^2} \right] = \mathbb{E}_X \left[ \frac{x}{\lambda^2} \right].
$$

To compute this expectation, we recall that for a Poisson distribution, the expected value $\mathbb{E}_X[x]$ is $\lambda$. Therefore,
$$
I_F(\lambda) = \frac{\mathbb{E}_X[x]}{\lambda^2} = \frac{\lambda}{\lambda^2} = \frac{1}{\lambda}.
$$

##### Comments: 
The amount of information decrease with $\lambda$. Increasing the Poisson rate, each additional observation provides less information

## Exercise 4: Fisher Information of the Mean of the Normal Distribution

Consider a normally distributed random variable $X$ with pdf
\begin{align*}
p(x ; \mu, \sigma^2) &= \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}, \\ 
\end{align*}
where $\mu$ and $\sigma^2$ are the parameters and the support is $x \in \mathbb R$.
Calculate the Fisher information for $\mu$. Again, assume that all necessary regularity conditions hold, i.e. you can use the form
\begin{align*}
I_F(\lambda) &= -\mathbb{E}_X \Big[ \frac{d^2}{d \mu^2} \ln p(x ; \mu, \sigma^2) \Big] \ .
\end{align*}

########## YOUR SOLUTION HERE ##########

To calculate the Fisher Information $I_F(\mu)$ for the mean $(\mu)$ of a normally distributed random variable with the probability density function (pdf) $p(x ; \mu, \sigma^2) = \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$, we start by taking the logarithm of the pmf, then find its second derivative with respect to $\mu$, and finally compute its expected value as required.:

##### Step 1: Compute the Logarithm of the pdf
$$
\ln p(x ; \mu, \sigma^2) = -\ln(\sigma \sqrt{2 \pi}) - \frac{(x-\mu)^2}{2\sigma^2}.
$$

##### Step 2: First Derivative w.r.t. $\mu$
$$
\frac{d}{d\mu} \ln p(x ; \mu, \sigma^2) = \frac{(x-\mu)}{\sigma^2}.
$$

##### Step 3: Second Derivative w.r.t. $\mu$
$$
\frac{d^2}{d\mu^2} \ln p(x ; \mu, \sigma^2) = -\frac{1}{\sigma^2}.
$$

Note that this second derivative is constant with respect to $x$, meaning that it does not depend on the specific value of $x$.

##### Step 4: Compute Fisher Information
Given the Fisher Information formula and the result of the second derivative being constant, the expected value operation becomes:
$$
I_F(\mu) = -\mathbb{E}_X \left[ -\frac{1}{\sigma^2} \right] = \frac{1}{\sigma^2}.
$$

##### Comments
The Fisher information increases with the decreasing of the variance $\sigma^2$. This mean that having concentrated data (around the mean) provide more information about the mean $\mu$

## Exercise 5: Fisher Information of the Variance of the Normal Distribution

Consider a normally distributed random variable $X$ with pmf
\begin{align*}
p(x ; \mu, \sigma^2) &= \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}, \\ 
\end{align*}
where $\mu$ and $\sigma^2$ are the parameters and the support is $x \in \mathbb R$.
Calculate the Fisher information for $\sigma^2$. Again, assume that all necessary regularity conditions hold, i.e. you can use the form
\begin{align*}
I_F(\lambda) &= -\mathbb{E}_X \Big[ \frac{d^2}{d (\sigma^2)^2} \ln p(x ; \mu, \sigma^2) \Big] \ .
\end{align*}

########## YOUR SOLUTION HERE ##########

##### Step 1: Compute the Logarithm of the pdf

$$
\ln p(x ; \mu, \sigma^2) = -\ln(\sigma \sqrt{2 \pi}) - \frac{(x-\mu)^2}{2\sigma^2}.
$$

##### Step 2: First Derivative w.r.t. $\sigma^2 (c)$

To simplify the calculation, we will take a dummy variable defined as $c=\sigma^2$ which lead to $\sqrt{c}=\sigma$. The derivative will then be: 

$$
\frac{d}{d c} \ln p(x ; \mu, c) = \frac{d}{d c} \left [ -\ln(\sqrt{2 \pi c}) - \frac{(x-\mu)^2}{2c} \right ] = \frac{d}{d c} \left [ -\frac{1}{2}\ln(2 \pi c) - \frac{(x-\mu)^2}{2c} \right ] = -\frac{1}{c} + \frac{(x-\mu)^2}{2c^2}
$$

##### Step 3: Second Derivative w.r.t. $\sigma^2 (c)$

$$
\frac{d^2}{d c^2} \ln p(x ; \mu, c) = \frac{1}{2c^2} - \frac{(x-\mu)^2}{c^3}
$$

Substituting the dummy variable we obtain: 

$$
\frac{d^2}{d (\sigma^2)^2} \ln p(x ; \mu, \sigma^2) = \frac{1}{2\sigma^4} - \frac{(x-\mu)^2}{\sigma^6}
$$

##### Step 4: Compute Fisher Information
Given the Fisher Information formula,
$$
I_F(\sigma^2) = -\mathbb{E}_X \Big[ \frac{d^2}{d (\sigma^2)^2} \ln p(x ; \mu, \sigma^2) \Big]
$$

To compute this expectation, we recall that the expected value $\mathbb{E}_X[(x-\mu)^2] = \sigma^2$. Therefore,
$$
I_F(\sigma^2) = -\frac{1}{2\sigma^4} + \frac{\sigma^2}{\sigma^6} = \frac{1}{2\sigma^4}.
$$

## Exercise 6: Variance of Arithmetic Mean

Given a sequence of iid random variables $X_1, \dots, X_n$, their arithmetic mean is given by
\begin{align*}
    \bar X = \frac1n \sum_{i=1}^n X_i. 
\end{align*}
Calculate the variance of that estimator. 

########## YOUR SOLUTION HERE ##########

From Forum:
For Ex 6, assume that $X_1, \dots, X_n$ have variance $\sigma^2$.

To calculate the variance of the arithmetic mean $\bar{X}$ of $n$ independent and identically distributed (iid) random variables $X_1, \dots, X_n$, where each $X_i$ has a variance of $\sigma^2$, we can use the properties of variance for sums of random variables and the scaling property. 

##### Step 1: Variance of a Sum
For iid random variables, the variance of their sum is the sum of their variances because the covariance between any two distinct random variables is zero. Thus, for $X_1, \dots, X_n$, we have:
$$
\text{Var}\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n \text{Var}(X_i) = n\sigma^2.
$$

##### Step 2: Variance of a Scaled Random Variable
The variance of a scaled random variable, where the scaling factor is $\frac{1}{n}$, is the square of the scaling factor times the variance of the original variable. Applying this to the sum of $X_i$, we get:
$$
\text{Var}(\bar{X}) = \text{Var}\left(\frac{1}{n} \sum_{i=1}^n X_i\right) = \left(\frac{1}{n}\right)^2 \text{Var}\left(\sum_{i=1}^n X_i\right) = \left(\frac{1}{n}\right)^2 n\sigma^2 = \frac{\sigma^2}{n}.
$$

##### Comments
The arithmetic mean of a large number of observations becomes more and more concentrated around the true mean of the distribution, assuming the variance $\sigma^2$ is finite.

## Exercise 7: Bias of Variance Estimator

Consider an iid sequence of random variables $X_1, \dots, X_n$ and the estimator
\begin{equation*}
    \hat \sigma^2 = \frac1n \sum_{i=1}^n (X_i - \bar X)^2
\end{equation*}
for the variance $\sigma^2$.
Calculate the bias of this estimator. 
If the estimator is biased, can you correct it? 

########## YOUR SOLUTION HERE ##########

From Forum: For Ex 7, assume that $X_1, \dots, X_n$ have unknown mean $\mu$ and variance $\sigma^2$.

To calculate the bias of the variance estimator $\hat \sigma^2 = \frac{1}{n} \sum_{i=1}^n (X_i - \bar X)^2$, we need to find the expected value of $\hat \sigma^2$ and compare it to the true variance $\sigma^2$. The bias of an estimator is defined as the difference between its expected value and the true parameter value it estimates, thus:
$$
\text{Bias}(\hat \sigma^2) = \mathbb{E}[\hat \sigma^2] - \sigma^2.
$$

Given that $X_1, \dots, X_n$ are iid random variables with mean $\mu$ and variance $\sigma^2$, let's proceed with the calculation.

##### Step 1: Expected Value of the Variance Estimator
The expected value of the variance estimator $\hat \sigma^2$ can be expressed as (from theory):
$$
\mathbb{E}[\hat \sigma^2] = \frac{n-1}{n} \sigma^2.
$$

##### Step 2: Calculate the Bias
Using the expected value of $\hat \sigma^2$, we can calculate its bias as:
$$
\text{Bias}(\hat \sigma^2) = \frac{n-1}{n} \sigma^2 - \sigma^2 = -\frac{\sigma^2}{n}.
$$

##### Bias Correction
The estimator $\hat \sigma^2$ is biased because its expectation is not equal to $\sigma^2$ but $\frac{n-1}{n} \sigma^2$. To correct this bias, we can multiply $\hat \sigma^2$ by $\frac{n}{n-1}$ to make it an unbiased estimator. The corrected (unbiased) estimator for the variance is then:
$$
\hat \sigma^2_{\text{unbiased}} = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar X)^2.
$$

This estimator is unbiased because its expected value is $\sigma^2$, as shown by:
$$
\mathbb{E}[\hat \sigma^2_{\text{unbiased}}] = \sigma^2.
$$

## Exercise 8: Variance of Variance Estimator

Write a python routine that estimates the variance of the estimator $\hat \sigma^2$ using a sample of $n=10$ standard normally distributed variables over 10000 trials. 
Then do the same for the bias-corrected version of $\hat \sigma^2$. 
Compare both estimates to the Cramer-Rao lower bound and interepret and explain the results. 

In [1]:
########## YOUR SOLUTION HERE ##########

import numpy as np

n_trials = 10000
n = 10

# True variance for a standard normal distribution, this is 1)
sigma_squared = 1

def simulate_variance_estimators(n, n_trials, sigma_squared):
    var_estimates = np.zeros(n_trials)
    var_estimates_unbiased = np.zeros(n_trials)
    
    for trial in range(n_trials):
        sample = np.random.standard_normal(n)
        sample_mean = np.mean(sample)
        var_estimates[trial] = np.sum((sample - sample_mean) ** 2) / n
        var_estimates_unbiased[trial] = np.sum((sample - sample_mean) ** 2) / (n - 1)
    
    # Calculate the variance of the variance estimators
    var_of_var_estimator = np.var(var_estimates)
    var_of_var_estimator_unbiased = np.var(var_estimates_unbiased)
    
    # Calculate the Cramer-Rao Lower Bound for the variance estimator
    crlb = (2 * sigma_squared**2) / n
    
    return var_of_var_estimator, var_of_var_estimator_unbiased, crlb

var_of_var_estimator, var_of_var_estimator_unbiased, crlb = simulate_variance_estimators(n, n_trials, sigma_squared)

var_of_var_estimator, var_of_var_estimator_unbiased, crlb


(0.18195663348514945, 0.22463781911746847, 0.2)

##### Comments
- Variance of the biased estimator is lower can the CRLB. This is due to the fact that is contains a bias and the CRLB refers to unbiased estimator.
- Variance of the unbiased estimator is bigger than CRLB. This is due to the fact that the CRLB is a theoretical limit for an unbiased estimator.

## Exercise 9: Maximum Likelihood Estimator for the Variance of a Gaussian

Derive the maximum likelihood estimator for the variance of $n$ data points $x_1, \dots, x_n$ drawn from a normal distribution. 
Is this estimator biased?
*Hint: use the log-likelihood for your derivation*

########## YOUR SOLUTION HERE ##########

To derive the Maximum Likelihood Estimator (MLE) for the variance $\sigma^2$ of $n$ data points $x_1, \dots, x_n$ drawn from a Normal distribution, we'll start with the probability density function (pdf) of a Normal distribution, move on to the log-likelihood function, differentiate it with respect to $\sigma^2$, set the derivative equal to zero to find the critical points, and solve for $\sigma^2$.

##### Step 1: Normal Distribution PDF
The pdf of a Normal distribution with mean $\mu$ and variance $\sigma^2$ is given by:
$$
p(x ; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}.
$$

##### Step 2: Log-Likelihood Function
The log-likelihood function for $n$ data points under this distribution is:
$$
\ln \mathcal{L}(\sigma^2) = \sum_{i=1}^{n} \ln \left( \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x_i-\mu)^2}{2\sigma^2}} \right) = -\frac{n}{2} \ln(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (x_i-\mu)^2.
$$

##### Step 3: Differentiate Log-Likelihood w.r.t. $\sigma^2$
Differentiating $\ln \mathcal{L}(\sigma^2)$ with respect to $\sigma^2$ and setting it to zero to find the maximum:
$$
\frac{d}{d\sigma^2} \ln \mathcal{L}(\sigma^2) = -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2} \sum_{i=1}^{n} (x_i-\mu)^2 = 0.
$$

##### Step 4: Solve for $\sigma^2$
Rearranging the terms to solve for $\sigma^2$:
$$
\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i-\mu)^2.
$$

##### Is the Estimator Biased?
The derived MLE for $\sigma^2$ assumes knowledge of the true mean $\mu$. However, in practice, $\mu$ is often unknown and estimated from the data as well, leading to the usual estimator for variance:
$$
\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2,
$$
where $\bar{x}$ is the sample mean. This estimator is actually biased for $\sigma^2$ because it underestimates the variance (by a factor of $\frac{n-1}{n}$). The unbiased estimator, commonly used in practice, corrects this by dividing by $n-1$ instead of $n$:
$$
\hat{\sigma}^2_{\text{unbiased}} = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2.
$$

Thus, while the MLE derived here is a direct result from maximizing the likelihood, it is biased when the mean $\mu$ is not known and has to be estimated from the data. The bias arises because the estimation of the mean consumes one degree of freedom, which is not accounted for in the MLE when dividing by $n$.