# 5 Point Estimation
Point estimation, a fundamental task in the subject of statistics, requires a single 'best' guess for the unknown parameters $\theta$ by a statistic,
$$\hat \theta = g(X_1,X_2,\dotsc,X_n).$$

And our estimator $\hat \theta$, 
relying on samples $X_1,X_2,\dotsc,X_n$, is also a random variable. In spite of the randomness of $\hat \theta$, we shall construct an estimator $\hat \theta$ that has a high probability to approach $\theta$.

## Methods of Moments Estimation

Let $\mu_k \equiv \mu_k(\theta) =   \mathbb E(X^k)$ denote the $k$-th moment. Denote the sample moment by 
$$M_k = \frac 1n \sum_{i=1}^n X_i^k.$$

The MM estimator (MME) $\hat \theta$ satisfies that the sample moments are equal to the theoretical ones. 

$$\mu_i (\hat \theta) = M_i\quad (i = 1,2,\dotsc,k).$$

For example when $k = 2$ and we estimate the mean and the variance $(\mu,\sigma^2)$. Since 
$\hat\mu = M_1$ and $\hat\sigma^2 + \hat \mu^2 = M_2$, the MM estimator is given by 
$$\hat \mu = \overline X\quad\quad\quad \hat \sigma^2 = \frac 1n \sum_{i=1}^n X_i^2 - (\overline X)^2
 = \frac1n\sum_{i=1}^n\left(X_i - \overline X\right)^2.$$
 
### Sample Variance

Note that in the order-2 MM estimator,
$$\begin{aligned}\mathbb E(\hat \sigma^2)& = \mathbb E( \frac 1n \sum_{i=1}^n X_i^2- (\overline X)^2) = 
\left({\rm Var}( X)+\left(\mathbb E( X)\right)^2\right)
 -\left({\rm Var}(\overline X)+\left(\mathbb E(\overline X)\right)^2\right)\\ 
 &=  (\mu^2+\sigma^2) - \left(\frac{1}{n}\sigma^2+\mu^2\right)
\\&=\frac{n-1}{n}\sigma^2
 \end{aligned}$$
 is biased. An unbiaseed estimator for $\sigma^2$ called the sample variance is given by 
 $$\hat\sigma'^2 =\frac{1}{n-1}\sum_{i=1}^n X_i^2 - (\overline X)^2$$

 ### Asymptotical Normality

Often (with some mild conditions), the MME $\hat \theta$ converges to $\theta$  and
$$\frac{\sqrt {n} (\hat \theta - \theta)}{\widehat{\rm SE}(\hat \theta)}\rightarrow N(0,1)\quad\quad (n\rightarrow \infty).$$

## Maximum Likelihood Estimation

### Likelihood Function

For fixed $x$ and variables $\theta$, $L(\theta) = f_\theta(x)$ denotes the density or probability of $x$ 
from the distribution $F(\theta)$. 

Our strategy of maximum likelihood estimation (MLE) is to find a $\hat \theta$ to maximize $L(\hat \theta)$.

### Log-Likelihood Function

Often $X_1,\dotsc,X_n$ are i.i.d. s from $F(\theta)$, hence
$$L(\theta) = \prod_{i=1}^n f_\theta(X_i).$$
To maximize $L(\theta)$ is equivalent to maximizing its logarithm, given by
$$I(\theta) = \log L(\theta) = \sum_{i=1}^n \log f_\theta(X_i).$$
Usually, $I(\theta)$ reaches the minimum when $\frac{\partial}{\partial \theta}I = 0$, or
$$0=s(\theta) = \frac{\partial}{\partial \theta}I(\theta) .$$
And we call $s(\theta)$ the score function.


### Consistency

MLE converges, i.e. $\hat \theta_{\rm MLE} \stackrel{\mathbb P}{\rightarrow} \theta$ as $n\rightarrow \infty$.

### Regression

Suppose the parameters $\Theta\in \mathbb R^{m\times n}$ is unknown. Given $X\in \mathbb R^n$, there is a random variable $Y\in \mathbb R^m$  that $Y = \Theta X+w$ where $w\in \mathbb R^m\sim N(0,\sigma^2\mathbb I_m)$ is the noise and $\sigma^2$ is unknown. Now that we have observed $(X_1,Y_1),\dotsc,(X_t,Y_t)$, the likelihood function is 
$$L(\Theta) = \prod_{j=1}^t \frac{1}{(2\pi\sigma^2)^\frac m2  }\exp\left(-\frac{1}{2\sigma^{2}}(Y_j - \Theta X_j)^T (Y_j - \Theta X_j)\right).$$

It has log-likelihood function
$$\begin{aligned}I(\theta)& = \sum_{j=1}^t  \left(-\frac{1}{2\sigma^2}(Y_j - \Theta X_j)^T(Y_j - \Theta X_j)-\frac m2\log(2\pi) - m \log|\sigma| \right)\\
&=-\frac{1}{2\sigma^2}\left({\rm tr}(\mathbf X^T\Theta^T\Theta \mathbf X)-2{\rm tr}(\mathbf Y^T\Theta\mathbf X)+{\rm tr}(\mathbf Y^T\mathbf Y)\right)
-\frac m2t\log(2\pi) - mt \log|\sigma| \\
&=-\frac{1}{2\sigma^2}\Vert\mathbf Y -\Theta\mathbf X\Vert_F^2 -\frac m2t\log(2\pi) - mt \log|\sigma|.
\end{aligned}$$

The MLE $\hat \Theta$ is therefore given by $\hat \Theta = {\rm argmin}\Vert \mathbf Y -\Theta\mathbf X\Vert_F^2=\mathbf Y\mathbf X^\dag$, solution to a least squares problem. Further, take the derivative with respect to $\sigma$ and we obtain
$$\hat \sigma^2=\frac{\Vert \mathbf Y -\mathbf Y\mathbf X^\dag \mathbf X\Vert_F^2}{mt} .$$


### Univariable Gaussian

Suppose $Y_1,\dotsc,Y_t\in \mathbb R$ are samples from $N(\mu,\sigma^2)$ and $\mu,\sigma^2$ are unknown. MLE estimator to $(\mu,\sigma)$ is a special case of the regression. Introduce $X_1 = X_2 = \dotsc  = X_n\equiv 1$ and $\Theta = \mu$ in the regression above, we know that $\mathbf X^\dag = [\frac 1n,\dotsc,\frac 1n]^T$ and 
$$\hat \mu = \hat \Theta = \mathbf Y\mathbf X^\dag = \sum_{i=1}^t\frac 1t Y_i = \overline Y$$
and the MLE of the variance is given by
$$\hat \sigma^2 = \frac{1}{t}\sum_{i=1}^t (Y_i - \overline Y)^2.$$

It is equivalent to MME in this case.

## Fisher Information

Consider the score function $s(\theta) = \frac{\partial}{\partial \theta}\log f_{\theta}(x)$ . Given $\theta$ and $s(\theta)$ is a random variable associated with $x$, then
$$\mathbb E(s(\theta))
=\int s(\theta)f_\theta (x)dx=\int \frac{1}{ f_\theta(x)}\frac{\partial f_\theta  (x)}{\partial \theta}f_\theta(x)dx
=\frac{\partial}{\partial\theta}\int f_{\theta }(x)dx=\frac{\partial}{\partial\theta}{\bf 1}=0
$$

And it has variance matrix
$$\mathcal I(\theta) = {\rm Var}(s(\theta))=\mathbb E(s(\theta)s(\theta)^T)$$

Note that the $(i,j)$ entry can be represented by 
$$\begin{aligned}&\ \mathbb E(\frac{\partial s}{\partial \theta_i}\frac{\partial s}{\partial \theta_j})
=\int \frac{1}{f_\theta (x)}\frac{\partial f_\theta(x)}{\partial \theta_i}
\frac{1}{f_\theta (x)}\frac{\partial f_\theta (x)}{\partial \theta_j}
f_\theta (x)dx = \int \frac{1}{f_\theta (x)}\frac{\partial f}{\partial \theta_i}\frac{\partial f}{\partial \theta_j}dx\\ &=  \int \frac{1}{f_\theta (x)}\frac{\partial f}{\partial \theta_i}\frac{\partial f}{\partial \theta_j}
dx-\frac{\partial ^2}{\partial \theta_i\theta_j}{\bf 1}

=\int \left(\frac{1}{f_\theta (x)}\frac{\partial f}{\partial \theta_i}\frac{\partial f}{\partial \theta_j}
-\frac{\partial ^2f}{\partial \theta_i\theta_j}\right)dx\\ &
=\mathbb E\left(\frac{1}{f^2_\theta (x)}\frac{\partial f}{\partial \theta_i}\frac{\partial f}{\partial \theta_j}
-\frac{1}{f_\theta (x)}\frac{\partial ^2f}{\partial \theta_i\theta_j}\right)
=-\mathbb E\left(\frac{\partial }{\partial \theta_j}\int \frac{1}{f_\theta (x)}\frac{\partial f}{\partial \theta_i}dx
\right) \\ &= 
-\mathbb E\left(\frac{\partial ^2}{\partial \theta_i\partial \theta_j}\log f_\theta (x)
\right).
\end{aligned}
$$

Hence, $$\mathcal I(\theta) = {\rm Var}(s(\theta))=\mathbb E(s(\theta)s(\theta)^T)
=-\nabla_{\theta}^2\mathbb E(\log L(\theta; x))$$
The variance of $s(\theta)$ is called the Fisher information matrix.

In particular when $X=[X_1,\dotsc,X_n]^T$ is a group of iids, we have 
$$\begin{aligned}\mathcal I_X(\theta)& =-\nabla_{\theta}^2\mathbb E_X(s_{X}(\theta)) 
=-\nabla_{\theta}^2\int \dotsi\int \sum_{i=1}^n \log f_\theta (x_i) \prod_{j=1}^n f_\theta(x_i)dx_1\dotsm dx_n\\
&=-n\nabla_{\theta}^2\int \dotsi\int \prod_{j=2}^n f(x_j)dx_2\dotsm dx_n\int\log f_\theta (x_1) f_\theta(x_1)dx_1 
\\&=-n\nabla_{\theta}^2\int\log f_\theta (x_1) f_\theta(x_1)dx_1 \\&=n\mathcal I_{X_1}(\theta).
\end{aligned}$$


### Cramér-Rao Inequality

Suppose $g(\theta)$ is differentiable and a statistics $T$ satisfies that $\mathbb E(T) = g(\theta)$ at parameters $\theta$ so that $T$ is an unbiased estimator for $g(\theta)$ everywhere. Then, the variance of the statistics is bounded below by [[1](https://stats.stackexchange.com/questions/500529/proof-of-the-multivariate-cramer-rao-inequality)]
$${\rm Var}(T)\succeq  \left(\frac{\partial g}{\partial \theta}\right)^T\mathcal I(\theta)^{-1}\left(\frac{\partial g}{\partial \theta}\right).$$

Call $T$ the **minimum variance unbiased estimator (MVUE)** of $g$ when the equality holds. 

Particularly, if $g(\theta)\equiv \theta$ and $T$ is unbiased, we learn that 
$${\rm Var}(\hat \theta_j) \succeq \left(\mathcal I(\theta)_j\right)^{-1}
=-\left(\frac{\partial^2}{\partial \theta_j^2}\log f_\theta (x)\right)^{-1}.$$


Example: For a 1D normal distribution $N(\mu,\sigma^2)$ where $\sigma^2$ is known but $\mu$ is unknown. Since we know $\log L(\mu; x) = -\frac{(x-\mu)^2}{2\sigma^2}-\log \sqrt{2\pi \sigma^2}$,  the Fisher information matrix for $\mu$ is given by
$$\mathcal I(\mu) = -\frac{\partial^2}{\partial \mu^2}\mathbb E(\log L(\mu; x))
=\sigma^{-2}.
$$

Now assume we have $X = [X_1,\dotsc,X_n]^T$ as a sample, $\mathcal I_X(\mu) = n\sigma^{-2}$. Let $T$ be the plugin estimator $T = \overline X$ and $\mathbb E(T)=\mu$. Its variance is given by 
$${\rm Var}(T) = {\rm Var}\left(N(\mu,\frac1n \sigma^2)\right) = \frac 1n \sigma^2=
\frac{\left(\mathbb E\left(T\right)\right)^2}{\mathcal I_X(\mu)}.$$

Hence we conclude that $T = \overline X$ is the MVUE.


### Asymptotic Normality

As $n\rightarrow \infty$, the MLE estimator on iids holds the property that 
$$\sqrt n(\hat \theta_{\rm MLE} - \theta)\stackrel{d}{\rightarrow}N(0,\mathcal I_{X_1}(\theta)^{-1}).$$

This implies that MLE estimator attains the Cramér-Rao lower bound when $n\rightarrow \infty$. Any (consistent) estimator that has the property is called efficient.