# 4 Statistics

Suppose there is a distribution with parameters $\theta$. For example, a typical normal distribution has parameters 
$\theta = (\mu,\sigma^2)$, the mean and the variance, while a Gammma distribution might have parameters such as $\theta = (\alpha, \beta)$. 

Now we suppose that these parameters are **unknown**. The only information we have is a random sample from the distribution, $X_1,\dotsc,X_n$. What we wish to do, is to guess (or estimate) the unknown paramaters $\theta$ from the samples.


### Sample 

Sample, or a random sample, is $n$ **independent** data sampled from the same distribution, which we often denote 
by $$X = \left(X_1,X_2,\dotsc,X_n\right).$$

### Statistic

A function of the sample is called a statistic. For example, 
$$\hat \mu(X) = \frac 1n\left(X_1+X_2+\dotsc +X_n\right)$$
is a statistic because one can compute $\hat \mu(X)$ from $X$ without other knowledge.


## Point Estimation

Point estimation, a fundamental task in the subject of statistics, requires a single 'best' guess for the unknown parameters $\theta$ by a statistic,
$$\hat \theta = g(X_1,X_2,\dotsc,X_n).$$

And our estimator $\hat \theta$, 
relying on samples $X_1,X_2,\dotsc,X_n$, is also a random variable. In spite of the randomness of $\hat \theta$, we shall construct an estimator $\hat \theta$ that has a high probability to approach $\theta$.



### Mean Square Error

To evaluate the performance of the estimator $\theta$, we introduce the mean square error
$$\mathcal L_{MSE} = \mathbb E\left((\hat \theta -\theta)^2\right)$$

This characterizes the average (expected) distance from $\hat \theta $ to $\theta$. 

### Bias

One can simplify the mean square error by 
$$\begin{aligned}
\mathcal L_{MSE} = \mathbb E\left((\hat \theta -\theta)^2\right)
= \left( \mathbb E(\hat \theta -\theta)\right)^2
+\left({\rm Var}(\hat \theta-\theta)\right)^2
=\left( {\rm Bias}(\hat \theta)\right)^2
+\left({\rm Var}(\hat \theta)\right)^2
\end{aligned}$$
where $ {\rm Bias}(\hat \theta)= \mathbb E(\hat \theta -\theta)= \mathbb E(\hat \theta )-\theta$ is called the bias of the estimator. 

In particular, when a estimator has zero bias for whatever $\theta$, we call the estimator unbiased.

### Standard Deviation

The standard deviation of the estimator is 
$${\rm SE}(\hat \theta) = \sqrt{{\rm Var}(\hat \theta)}.$$

Therefore we can rewrite the mean square error by bias and the standard deviation,
$$\mathcal L_{MSE} = \left( {\rm Bias}(\hat \theta)\right)^2+\left({\rm SE}(\hat \theta)\right)^2$$

### Estimated Standard Error

The estimated standard error is an estimation to ${\rm SE}(\hat \theta)$ using estimated $\hat \theta$,
$$\widehat{\rm SE}(\hat \theta) = \sqrt{{\rm Var_{\hat \theta}}(\hat \theta)}.$$

### Consistency

The estimator $\hat \theta$ is consistent if $\hat \theta_n\stackrel{\mathbb P}{\rightarrow}\theta$, converges as 
the sample grows large. 

If ${\rm MSE}(\hat \theta_n)\rightarrow 0$, we can conclude that 
$\hat \theta_n \stackrel{m.s.}{\rightarrow}\theta$ and therefore $\hat \theta\stackrel{\mathbb P}{\rightarrow}\theta$.

Proof: By the Markov inequality, for any $\epsilon>0$ we have
$$\mathbb P((\hat \theta - \theta)^2\geqslant \epsilon) \leqslant 
\frac{\mathbb E\left((\hat \theta - \theta)^2\right)}{\epsilon}\rightarrow 0$$

### Asymptotic Normality

An estimator $\hat \theta$ is asymptotically normal if
$$\frac{\hat \theta - \theta}{{\rm SE}(\hat \theta)}\stackrel{d}{\rightarrow}N(0,1).$$

#### Example 

In a Bernoulli distribution $B(p)$ where $p$ is an unknown parameter. Suppose now we have sampled $X_1,X_2,\dotsc,X_n$, we estimate $p$ by $\hat p = \frac{1}{n}\left(X_1+X_2+\dotsc +X_n\right)$. Then its bias is 
$${\rm Bias}(\hat p) = \mathbb (\hat p ) - p = 0.$$
Since $\hat p\sim B(n,p)$, its standard deviation is
$${\rm SE}(\hat p) = \frac {1}{n^2}np(1-p) = \frac{p(1-p)}{n}$$
and the estimated standard error is substituting $p$ by $\hat p$, that is, 
$$\widehat {\rm SE}(\hat p) = \frac{\hat p(1 - \hat p )}{n}.$$


## Confidence Sets

For a distribution withunknown parameters $\theta$. We estimate $\theta$ by random samples $X_1,X_2,\dotsc,X_n$. If $L,U$ are statistics and for **fixed** $\forall \theta$,
$$\mathbb P(L(X_1,X_2,\dotsc,X_n)\leqslant \theta \leqslant U(X_1,X_2,\dotsc,X_n)) \geqslant \alpha$$ 
with respect to $X$. Then we call $(L,U)$ is (at least) a $(100\alpha) \%$ confidence set for $\theta$.

<font color = red>The randomness is on samples $X$. </font>


#### Example

Suppose $X_1,X_2,\dotsc,X_n$ are $n$ samples from a normal distribution $N(\mu,1)$ where $\mu$ is an unknown constant. Then the outcome we observe has the probability that 
$$\mathbb P(-1.96\leqslant \sqrt{n}(\overline X - \mu) \leqslant 1.96) \approx 0.95,$$
which can be equivalently intepretted as 
$$\mathbb P(\overline X -\frac{1.96}{\sqrt n}\leqslant   \mu \leqslant \overline X +\frac{1.96}{\sqrt n}) \approx 0.95.$$

This means: if we estimate $\mu$ by lower bound $\overline X -\frac{1.96}{\sqrt n}$  and upper bound
$\overline X +\frac{1.96}{\sqrt n}$, then we have $95\%$ chance to be correct (regardless of what $\mu$ is!). And 
$\left(\overline X -\frac{1.96}{\sqrt n},\overline X +\frac{1.96}{\sqrt n}\right)$ is a $95\%$ confidence set (or confidence interval) for estimating $\mu$.

### Asymptotical Normality

If we assume $\hat \theta_n$ is asymptotically normal, i.e. $\frac{\hat \theta - \theta}{{\rm  SE}(\hat \theta)}\stackrel{d}{\rightarrow} N(0,1)$. Then if we assume $n$ is sufficiently large, we can regard it as a normal distribution and for arbitrary $0\leqslant \alpha \leqslant 1$, we have, asymptotically, that 
$$\mathbb P\left(-\Phi^{-1}\left(\frac {1+\alpha}{2}\right)\leqslant \frac{\hat \theta - \theta}{{\rm  SE}(\hat \theta)}\leqslant \Phi^{-1}\left(\frac {1+\alpha}{2}\right)\right) = \alpha,$$
or, 
$$\mathbb P\left(\hat \theta-{\rm  SE}(\hat \theta)\Phi^{-1}\left(\frac {1+\alpha}{2}\right)\leqslant  \theta \leqslant \hat \theta+{\rm  SE}(\hat \theta) \Phi^{-1}\left(\frac {1+\alpha}{2}\right)\right) = \alpha$$

In particular, $\alpha = 2\Phi(2) - 1$ (more precisely, $\alpha = 2\Phi(1.96) - 1$) leads to a practial $95\%$ confidence interval, 
$$\mathbb P\left(\hat \theta-2{\rm  SE}\leqslant  \theta \leqslant \hat \theta+2{\rm  SE}(\hat \theta) \right) \approx 0.95.$$

In [4]:
2 * pnorm(2) - 1.

## Hypothesis Testing 

There are three fundamental problems in statistics, point estimation, confidence sets and hypothesis testing.

Assume we have a hypothesis $H_0$. We call it a null hypothesis. We check that whether or not the data 
reject the hypothesis. This is the hypothesis testing. The topic will be placed in the following courses.


## Empirical Distribution Function 

Let $X_1,\dotsc,X_n$ be a sample from CDF $F$. An non-parametric estimator for $F$ is given by 
$$\hat F(x) = \frac1n \sum_{i=1}^n \mathbb I(X_i\leqslant x) = \frac{{\rm Number \ of\ } X_i\leqslant x}{n}.$$

For fixed $x$ it is clear that $\hat F(x)\sim \frac1n  B(n,F(x))$, so
$$\mathbb E(\hat F(x)) = F(x)\quad \quad {\rm Var}(\hat F(x)) = \frac1nF(x)(1 - F(x)).$$