# Statistical foundations of Machine Learning INFO-F-422


## TP 2 - Estimation

#### *Yann-Aël Le Borgne, Fabrizio Carcillo and Gianluca Bontempi*

####  March 14, 2017



## Basic notions

* Estimation: it is the procedure which allows to *estimate* a parameter of a distribution (expected value, variance, ...) from $N$ samples drawn from this distribution.
* Typical estimators of the expected value and the variance are given by the sample mean
$$
 \hat{\mu}=\frac{1}{N}\sum_{i=1}^N z_i\\
$$
and sample variance
$$
\hat{\sigma}^2= \frac{1}{N-1}\sum_{i=1}^N (z_i-\hat{\mu})^2,
$$
where $D_N=\{z_1,\ldots,z_n\}$ is our sampleset.
* An estimator $\hat{\boldsymbol{\theta}}$ is a random variable itself, since it depends on a random sample $\mathbf{D}_N$.
* An estimator $\hat{\boldsymbol{\theta}}$ of a parameter $\theta$ is called unbiased if and only if

\begin{equation}
 {E}_{\boldsymbol{D}_N}[\hat{\boldsymbol{\theta}}]=\theta.
\end{equation}

If not, we define the *bias* as follows

\begin{equation}
\mbox{Bias}[\hat{\boldsymbol{\theta}}]={E}_{\boldsymbol{D}_N}[\hat{\boldsymbol{\theta}}]-\theta.
\end{equation}
*  The variance of an estimator is defined as
\begin{equation}
 \mbox{Var}[\hat{\boldsymbol{\theta}}]={E}_{\boldsymbol{D}_N}[(\hat{\boldsymbol{\theta}}-E[\hat{\boldsymbol{\theta}}])^2].
\end{equation}
*  Bias and variance of $\hat{\mathbf{\mu}}$:
\begin{equation}
 {E}_{\boldsymbol{D}_N}[\hat{\boldsymbol{\mu}}]=\mu.
\end{equation}
The estimator $\hat{\boldsymbol{\mu}}$ is therefore unbiased and its variance is
\begin{equation}
  \mbox{Var}[\hat{\boldsymbol{\mu}}]=\frac{\sigma^2}{N}.
\end{equation}
where $\mbox{Var}[{\mathbf{z}}]=\sigma^2$.

*  Bias of $\hat{\boldsymbol{\sigma}}^2$:
\begin{equation}
 E_{\boldsymbol{D}_N}[\hat{\boldsymbol{\sigma}}^2]=\sigma^2.
\end{equation}
The estimator $\hat{\boldsymbol{\sigma}}^2$ is thus unbiased.

*  The quality of an estimator $\hat{\boldsymbol{\theta}}$ can be measured using the *mean square error*

\begin{equation}
 \mbox{MSE}={E}_{\boldsymbol{D}_N}[(\theta - \hat{\boldsymbol{\theta}})^2].
\end{equation}
We can show that for all estimators $\hat{\boldsymbol{\theta}}$
\begin{equation}
 \mbox{MSE}=\mbox{Var}[\hat{\boldsymbol{\theta}}]+({E}[\hat{\boldsymbol{\theta}}]-\theta)^2.
\end{equation}
is the sum of the variance and the squared bias.
*  Let $\hat{F}_z(x)=\frac{1}{z}\sum_{i=1}^z \mathbb{1}_{x_i\le t}$ be the empirical distribution function. We have
\begin{equation}
 {E}_{\boldsymbol{D}_N}[\hat{\bf F}_z(x)]=F_z(x),
\end{equation}
where $F_z(x)$ is the distribution function of the variable $\boldsymbol{z}$.

*  Let $N$ observations be drawn form a normal distribution with mean $\mu$ and standard deviation $\sigma$. The estimator $\hat{\boldsymbol{\mu}}$ of the mean  follows a normal distribution with mean $\mu$ and standard deviation $\sigma/\sqrt{N}$. It follows that a confidence interval for $\mu$ is given by

\begin{equation}
 \mbox{Prob}\left\{ \hat{\boldsymbol{\mu}}-z_{\alpha/2}\frac{\sigma}{\sqrt{N}} \le \mu\le \hat{\boldsymbol{\mu}}+z_{\alpha/2}\frac{\sigma}{\sqrt{N}}\right\}=1-\alpha,
\end{equation}

where $\alpha$ is directly related to the probability $P$ that the interval contains $\mu$.



## Practical experiments 

The R programs are written in files with the extension '.R', which can be edited using text editors such as emacs, gedit, etc. The file can be loaded in the R terminal with the command

```
source("filename.R")
```

The additional parameter *print.eval=T* forces all outputs of the scripts to be displayed on the screen: 
```
source("filename.R", print.eval=T)
```

You can directly change into the directory containing the scripts using the command 
```
setwd("directory containing the scripts")
```
The goal of this TP is to write script for the following exercises. 

You can use the scripts *cumdis.R, cumdis_2.R, sam_dis.R, sam_dis2.R, sam_dis_unif.R, mse_bv.R, combine.R* and *confidence.R* to help you with the exercises. The scripts are available on the homepage of the course: https://github.com/gbonte/gbcode/tree/master/inst/scripts.

## Distribution function 

Write a script that displays the empirical cumulative distribution function of a distribution $\mathcal{N}(1,2)$ with 100 observations. Use the functions *ecdf* and *rnorm*. See [cumdis.R](https://github.com/gbonte/gbcode/tree/master/inst/scripts/cumdis.R).



## Expected value of the empirical distribution function

Write a script which verifies the assertion ${E}_{\mathbf{D}_N}[\hat{\bf F}_z(x)]=F_z(x)$ concerning the cumulative empirical distribution function. Modify the previous code in order to

* generate $R$ samples of 100 observations
* average the $R$ empirical cdfs
* trace the distribution function of the sample mean and compare it with the theoretical distribution function.
* observe the results for $R\in \{5,10,50,100\}.$

See [cumdis_2.R](https://github.com/gbonte/gbcode/tree/master/inst/scripts/cumdis_2.R).

## Estimator of the mean

Write a script which returns 1000 estimations of the sample's mean using $N$ observations following a normal distribution $\mathcal{N}(0,100)$. Une $N=50$, $N=75$ and $N=100$. Plot the histogram of these estimations and compare this with the theoretical distribution of the mean's estimator. 

Make use of the script [sam_dis.R](https://github.com/gbonte/gbcode/tree/master/inst/scripts/sam_dis.R) which allows to see in practice how the estimator $\hat{\boldsymbol{\mu}}$ of the mean is distributed for a normal distribution of the data. 

Observe that it is unbiased and that its distribution is $\mathcal{N}(\mu, \sigma^2/N)$ (have a look at the observed variances and at the shape of the histograms).



## Estimator of the variance

Proceed equivalently except that here the estimator of the variance is considered. We want to verify that $\frac{(N-1)\hat{\boldsymbol{\sigma}}^2}{\sigma^2}\sim \chi^2_{N-1}$. Use $N=10$.



## Bias and variance

Write a script which verifies the equation

\begin{equation}
 \mbox{MSE}=\mbox{Var}[\hat{\boldsymbol{\theta}}]+({E}[\hat{\boldsymbol{\theta}}]-\theta)^2.
\end{equation}

Take as an example the estimator of the mean of 10 observations following the distribution $\mathcal{N}(0,100)$, by generating 10000 estimations.



## Mean of estimators

The mean of unbiased estimators, having the same variance is itself unbiased but has a variance twice smaller than that of the estimators it has been derived from (see slide 30 in the file http://www.ulb.ac.be/di/map/gbonte/mod_stoch/nonlin.pdf). Write a script which illustrates this by

* generating independently two distributions of the estimator of the mean for a uniform distribution assuming values between -10 and 10,
* displaying the histograms of the two distributions and the combination of both and compute their variance.




## Confidence intervals

Write a script which generates $N$ samples of the distribution $\mathcal{N}(1,5)$, returns the percentage of values not falling into the $P\%$ confidence interval. Test, using $P=95\%$ with $N=100$ and $N=1000$.

