# 5 Point Estimation
Point estimation, a fundamental task in the subject of statistics, requires a single 'best' guess for the unknown parameters $\theta$ by a statistic,
$$\hat \theta = g(X_1,X_2,\dotsc,X_n).$$

And our estimator $\hat \theta$, 
relying on samples $X_1,X_2,\dotsc,X_n$, is also a random variable. In spite of the randomness of $\hat \theta$, we shall construct an estimator $\hat \theta$ that has a high probability to approach $\theta$.

## Methods of Moments Estimation

Let $\mu_k \equiv \mu_k(\theta) =   \mathbb E(X^k)$ denote the $k$-th moment. Denote the sample moment by 
$$M_k = \frac 1n \sum_{i=1}^n X_i^k.$$

The MM estimator (MME) $\hat \theta$ satisfies that the sample moments are equal to the theoretical ones. 

$$\mu_i (\hat \theta) = M_i\quad (i = 1,2,\dotsc,k).$$

For example when $k = 2$ and we estimate the mean and the variance $(\mu,\sigma^2)$. Since 
$\hat\mu = M_1$ and $\hat\sigma^2 + \hat \mu^2 = M_2$, the MM estimator is given by 
$$\hat \mu = \overline X\quad\quad\quad \hat \sigma^2 = \frac 1n \sum_{i=1}^n X_i^2 - (\overline X)^2
 = \frac1n\sum_{i=1}^n\left(X_i - \overline X\right)^2.$$
 
### Sample Variance

Note that in the order-2 MM estimator,
$$\begin{aligned}\mathbb E(\hat \sigma^2)& = \mathbb E( \frac 1n \sum_{i=1}^n X_i^2- (\overline X)^2) = 
\left({\rm Var}( X)+\left(\mathbb E( X)\right)^2\right)
 -\left({\rm Var}(\overline X)+\left(\mathbb E(\overline X)\right)^2\right)\\ 
 &=  (\mu^2+\sigma^2) - \left(\frac{1}{n}\sigma^2+\mu^2\right)
\\&=\frac{n-1}{n}\sigma^2
 \end{aligned}$$
 is biased. An unbiaseed estimator for $\sigma^2$ called the sample variance is given by 
 $$\hat\sigma'^2 =\frac{1}{n-1}\sum_{i=1}^n X_i^2 - (\overline X)^2$$

 ### Asymptotical Normality

Often (with some mild conditions), the MME $\hat \theta$ converges to $\theta$  and
$$\sqrt {n} (\hat \theta - \theta)\rightarrow N(0,1)\quad\quad (n\rightarrow \infty).$$

## Maximum Likelihood Estimation

### Likelihood Function

For fixed $x$ and variables $\theta$, $L(\theta) = f_\theta(x)$ denotes the density or probability of $x$ 
from the distribution $F(\theta)$. 

Our strategy of maximum likelihood estimation (MLE) is to find a $\hat \theta$ to maximize $L(\hat \theta)$.

### Log-Likelihood Function

Often $X_1,\dotsc,X_n$ are i.i.d. s from $F(\theta)$, hence
$$L(\theta) = \prod_{i=1}^n f_\theta(X_i).$$
To maximize $L(\theta)$ is equivalent to maximizing its logarithm, given by
$$I(\theta) = \log L(\theta) = \sum_{i=1}^n \log f_\theta(X_i).$$
Usually, $I(\theta)$ reaches the minimum when $\frac{\partial}{\partial \theta}I = 0$, or
$$0=s(\theta) = \frac{\partial}{\partial \theta}I(\theta) .$$
And we call $s(\theta)$ the score function.

### Regression

Suppose the parameters $\Theta\in \mathbb R^{m\times n}$ is unknown. Given $X\in \mathbb R^n$, there is a random variable $Y\in \mathbb R^m$  that $Y = \Theta X+w$ where $w\in \mathbb R^m\sim N(0,\sigma^2\mathbb I_m)$ is the noise and $\sigma^2$ is unknown. Now that we have observed $(X_1,Y_1),\dotsc,(X_t,Y_t)$, the likelihood function is 
$$L(\Theta) = \prod_{j=1}^t \frac{1}{(2\pi\sigma^2)^\frac m2  }\exp\left(-\frac{1}{2\sigma^{2}}(Y_j - \Theta X_j)^T (Y_j - \Theta X_j)\right).$$

It has log-likelihood function
$$\begin{aligned}I(\theta)& = \sum_{j=1}^t  \left(-\frac{1}{2\sigma^2}(Y_j - \Theta X_j)^T(Y_j - \Theta X_j)-\frac m2\log(2\pi) - m \log|\sigma| \right)\\
&=-\frac{1}{2\sigma^2}\left({\rm tr}(\mathbf X^T\Theta^T\Theta \mathbf X)-2{\rm tr}(\mathbf Y^T\Theta\mathbf X)+{\rm tr}(\mathbf Y^T\mathbf Y)\right)
-\frac m2t\log(2\pi) - mt \log|\sigma| \\
&=-\frac{1}{2\sigma^2}\Vert\mathbf Y -\Theta\mathbf X\Vert_F^2 -\frac m2t\log(2\pi) - mt \log|\sigma|.
\end{aligned}$$

The MLE $\hat \Theta$ is therefore given by $\hat \Theta = {\rm argmin}\Vert \mathbf Y -\Theta\mathbf X\Vert_F^2=\mathbf Y\mathbf X^\dag$, solution to a least squares problem. Further, take the derivative with respect to $\sigma$ and we obtain
$$\hat \sigma^2=\frac{\Vert \mathbf Y -\mathbf Y\mathbf X^\dag \mathbf X\Vert_F^2}{mt} .$$


### Gaussian

Suppose $Y_1,\dotsc,Y_t\in \mathbb R$ are samples from $N(\mu,\sigma^2)$ and $\mu,\sigma^2$ are unknown. MLE estimator to $(\mu,\sigma)$ is a special case of the regression. Introduce $X_1 = X_2 = \dotsc  = X_n\equiv 1$ and $\Theta = \mu$ in the regression above, we know that $\mathbf X^\dag = [\frac 1n,\dotsc,\frac 1n]^T$ and 
$$\hat \mu = \hat \Theta = \mathbf Y\mathbf X^\dag = \sum_{i=1}^t\frac 1t Y_i = \overline Y$$
and the MLE of the variance is given by
$$\hat \sigma^2 = \frac{1}{t}\sum_{i=1}^t (Y_i - \overline Y)^2.$$

It is equivalent to MME in this case.