### Week 5 - Statistical Interference

# Statistical Inference

## Introduction

* We often want to understand characteristics of a **population** based on a **sample** drawn from it. This process is known as **statistical inference**.
* Recall from previous discussions (Slides 2 & 5) that population parameters (like $\mu$ for mean and $\sigma^2$ for variance) are often unknown.
* Statistical inference provides tools to estimate these unknown parameters and make decisions about the population based on sample data.

## Parameters Estimation (Slide 4)

* The core of statistical inference often involves **parameter estimation**.
* We use **estimators** (functions of the sample data) to estimate the unknown population parameters.

### Example (Slide 2 & 5)

* Consider the example of weights of individuals. Initially, we might assume a distribution with known parameters: $Weight \sim \mathcal{N}(\theta \equiv \{\mu = 68, \sigma = 16\})$.
* However, in reality, $\mu$ and $\sigma$ are likely unknown.
* We collect sample data, e.g., heights of people in a classroom: $y = (y_1, y_2, ..., y_n) = (1.75, 1.64, 1.81, 1.55, 1.51, 1.67, 1.83, 1.63, 1.72, ...)$.
* We can use this sample data to estimate the population mean and variance of heights.

## Bias and Estimator Quality (Slide 3)

* When we use an estimator, we want it to be "good". Key properties of an estimator include:

    * **Bias**: The difference between the expected value of the estimator and the true value of the parameter.
        * $Bias(\hat{\theta}) = E[\hat{\theta}] - \theta$
        * An estimator is **unbiased** if $E[\hat{\theta}] = \theta$, meaning its average value over many samples equals the true parameter.

    * **Variance**: A measure of the estimator's variability or spread around its expected value.
        * $Var(\hat{\theta}) = E[(\hat{\theta} - E[\hat{\theta}])^2]$
        * Lower variance indicates more precise estimates.

    * **Squared Error**: A measure that combines both bias and variance.
        * $MSE(\hat{\theta}) = E[(\hat{\theta} - \theta)^2] = Var(\hat{\theta}) + [Bias(\hat{\theta})]^2$
        * We often aim for estimators with low MSE.

## Maximum Likelihood Estimation (MLE) (Slide 53)

* **Maximum Likelihood Estimation (MLE)** is a common method for estimating parameters.
* The idea is to find the parameter values that maximize the **likelihood function**, which represents the probability of observing the given sample data under different parameter values.

    * Given a sample $y = (y_1, ..., y_n)$ from a distribution with parameter(s) $\theta$, the likelihood function is:
        $$L(\theta | y) = P(y_1, ..., y_n | \theta) = \prod_{i=1}^{n} P(y_i | \theta) \quad \text{(for i.i.d. samples)}$$

* We find the value of $\theta$ that maximizes $L(\theta | y)$. Often, it's easier to maximize the **log-likelihood function**:
    $$l(\theta | y) = \log L(\theta | y) = \sum_{i=1}^{n} \log P(y_i | \theta)$$

* **Consistency of MLE**: For many distributions, the MLE is **consistent**, meaning that as the sample size $n$ increases, the estimator $\hat{\theta}_{MLE}$ converges in probability to the true parameter $\theta$ ($\hat{\theta}_{MLE} \xrightarrow{p} \theta$). However, there are cases (e.g., high-dimensional problems) where this might not hold.

## Sampling Distribution (Slides 54 & 55)

* The **sampling distribution** of an estimator is the probability distribution of that estimator when computed from multiple independent random samples of the same size from the same population.

### Example (Slide 54)

* Consider a random sample $Y_1, ..., Y_n$ from a normal distribution $N(\mu, \sigma^2)$.
* The **sample mean** $\bar{Y} = \frac{1}{n} \sum_{i=1}^{n} Y_i$ is an estimator of the population mean $\mu$.
* The sampling distribution of the sample mean is:
    * $E[\bar{Y}] = \mu$ (unbiased estimator)
    * $Var(\bar{Y}) = \frac{\sigma^2}{n}$
* Furthermore, if the population is normally distributed, the sampling distribution of the sample mean is also normal:
    * If $Y_i \sim N(\mu, \sigma^2)$, then $\bar{Y} \sim N(\mu, \frac{\sigma^2}{n})$.
* Knowing the sampling distribution provides more information than just the mean and variance of the estimator; it tells us the entire probabilistic behavior of the estimator.

## Reading/Terms to Revise (Slide 55)

* **Estimator, parameter estimation**
* **Sample mean**
* **Maximum likelihood**
* **Sampling distribution**
* **Bias, variance, and squared error of an estimator**

## Next Steps (Slide 55)

* Next week, we will cover the **Central Limit Theorem** and **confidence intervals**, which build upon the concepts of sampling distributions and parameter estimators to quantify the uncertainty in our estimates.
* **Reading for next week:** Chapters 6 (Section 6.3) and 7 (primarily Sections 7.3, 7.4, also 7.5) of Ross.

Sum Square Error  
To estimate mean, compute derivative of $\sum_{i=1}^{n}(y_i - μ)^2 $  
Calculate that to equal 0 (turning point aka min)  
This equates to 1/n $\sum_{i=1}^{n} y_i$ which is just the sample mean  


Maximum likelihood estimation  
$ p(y | θ) =  \prod_{i=1}^{n}p(y_i | θ)$  
this is each item multiplied by each other  

As this function is normally small, and for easier calculation,
we usually take the logarithm of the function to more easily compute the derivative  
(this uses the properties of ln(a*b) = ln(a)+ln(b), ln(1/a) = -ln (a) and ln (a^b) = b ln (a))  
(don't forget exponent rule: e^a*e^b = e^(a+b))

(alternatives: Maximum a Posteriori (MAP), Bayesian Estimate, out of scope)

MLE will be in assignment (not mid-term)
Mid term will be week 1-4

### Tutorial


#### 2. ML estimator of the Bernouli Distribution
In the bernouli distribution  
$p(y|θ) = θ^y(1-θ)^{1-y}$  


2.1 Likelihood  
What is :  
$p(y|θ) = \prod_{i=1}^{n} p(y_i|θ)$
equal to?  

$p(y|θ) = p(y_1|θ) . p(y_2|θ) ... p(y_n|θ) $  
        = $θ^{y1}(1-θ)^{1-y_1} . θ^{y2}(1-θ)^{1-y_2} ... θ^{yn}(1-θ)^{1-y_n} $   
        = $θ^{\sum_{y1}^{yn} y_i} (1-θ)^{\sum_{y1}^{yn} 1-y_i}$  
  

Let $\sum_{y1}^{yn} y_i $ = m  
    $p(y|θ) = θ^m(1-θ)^{n-m}$  



  
2.2 Negative log-likelihood  
$p(y|θ) = θ^m(1-θ)^{n-m}$  
$L(y|θ) = -log[p(y|θ)]$  
        = $ - log θ^m + (- log (1-θ)^{n-m})$  
        = $ -m log θ - (n-m) log (1-θ) $  



2.3 Find the MLE  
$\frac{d[L(y|θ)]}{dθ}$  
    = $-m/θ - (n-m) ((\frac{1}{1-θ}) . (-1))$  
    = -m/θ + (n-m)/(1-θ)  

Let $\frac{d[L(y|θ)]}{dθ} = 0$  
-m/θ + (n-m)/(1-θ) = 0  
$\frac{-m(1-θ) + (n-m)θ}{(1-θ)θ} = 0$  
-m + mθ + nθ - mθ = 0  
-m + nθ = 0  
θ = m/n -> MLE estimator of parameter θ  
where n is number of trials and m is number of successes  
this requires n to be large enough  

#### 3. Hitchhiker's estimator
Hitchhiker's estimator always says the answer is 42, no matter what  

3.1 Quality of the hitchhiker's estimator  
Bias = Expected value of parameter - parameter  
1. Bias of hitchhiker's estimator:  
$B_μ(μ) = E[μ(Y)] - μ  
        = 42 - μ  

2. Variance of hitchhiker's estimator:  
    = 0 

3. Squared-error of hitchhiker's estimator:  
MSE = E[(μ(y) - μ)^2]   
    = (42 - μ)^2  

3.3 When does the hitchhiker's estimator work well? 
When is it better than the sample mean?  
$(42 - μ)^2 = σ^2/n$  
$42 - μ = \sqrt{\frac{σ^2}{n}}$  
$42 - μ = -\frac{σ}{\sqrt{n}}$      |      $42 - μ = \frac{σ}{\sqrt{n}}$  
$μ_2 = 42 + \frac{σ}{\sqrt{n}}$      |       $μ_1 = 42 - \frac{σ}{\sqrt{n}}$  