# EBA3500 Lecture 11. Expectation, consistency, and the adjusted $R^2$

## Review of expectation and variance


### Expectation
Recall the expectation $E(X)$, which equals the theoretical mean of a random variable.

#### Definition (Expectation)
> Let $X$ be a discrete random variable with probability mass function $p(x) = P(X=x)$. Then $E(X) = \sum xp(x)$ is the *expectation* or *expected value* of $X$.

**Note:** If $X$ isn't discrete, which is usually the case, we would have to use the integral instead of the sum, i.e. $E(X) = \int xp(x)dx$. There is *no difference* in interpretation though. The expectation is 

#### Example (Bernoulli)
Let $X$ be Bernoulli distributed with success probability $\pi$, i.e., $p(X=1) = \pi$ and $p(X=0) = 1 - \pi$. Then 
$$p(X=1)\cdot1 + p(X=0)\cdot 0 = \pi\cdot1 + (1-\pi) \cdot 0 = 0,$$
hence $E(X) = \pi$.

The expectation operator has a  momentous property, namely that of being linear. This property allows us to calculate stuff without using the definition of the expectation given above.

#### Proposition: Linearity of expectation
> Let $X_1, X_2$ be random variables and $a, b$ be numbers. Then
$E(aX_1 + bX_2) = aEX_1+bEX_2$.

In general, function is linear if it acts like this, for instance $f(ax+by) = af(x) + bf(y)$ where $a,b$ are numbers and $x,y$ are some kind of mathematical objects. For instance, matrix multiplication is linear, $A(ax+by) = aA(x) + bA(y)$.

#### Example (Bernoulli)
Suppose $X_1$ and $X_2$ are Bernoulli variables with parameters $\pi_1$ and $\pi_2$. What is the expectation of $X_1 + X_2$? 
Using the linearity of expectation, we find that $\pi_1 + \pi_2$! 

### Variance
The variance of a random variable $X$ captures its dispersion, or how spread out it is. 

#### Definition (Variance)
> Let $X$ be a discrete random variable. Then $$\textrm{Var}(X) = E(X^2) - E(X)^2$$ is the *variance* or of $X$.

#### Example (Bernoulli)
Let $X$ be Bernoulli distributed with success probability $\pi$, i.e., $p(X=1) = \pi$ and $p(X=0) = 1 - \pi$. Then $E(X) = \pi$, so we can calculate the $E(X)^2 = \pi^2$ part of the variance equation. To calculate the $EX^2$ part, use
$$p(X=1)\cdot1^2 + p(X=0)\cdot 0^2 = \pi\cdot1^2+ (1-\pi) \cdot 0 = 0^2.$$
Hence $\textrm{Var}(X) = \pi - \pi^2 = \pi(1-\pi)$.

### Estimator
Let $\theta$ be some population value, e.g., an expectation $E(X)$ of some sort, or a regression coefficient. This value is typically unknown. An estimator $\hat{\theta}_n$ is a statistical measurement, based on observed data, of this population value. Whenever we say a population value, think about input to the data-generating process: These are the exact values the creator of a simulation study decides on.

#### Definition (Estimator) ([Wikipedia source](https://en.wikipedia.org/wiki/Estimator))
> In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the sample mean is a commonly used estimator of the population mean.

#### Examples
1. Let $Y \sim \beta_0 + \beta_1 X + \epsilon$ and $\hat{\beta_1}$ be the regression coefficient estimated by least squares or least absolute deviations. Then $\hat{\beta_1}$ is an estimator of $\beta_1$, and $\beta_1$ is the estimand of $\hat{\beta_1}$.
2. Let $x = x_1, x_2, ..., x_n$ be some observed data. Then the mean (i.e., `np.mean(x)`) of $x$ is an estimator of the theoretical population mean $\mu$. 
3. Suppose we have a regression model with $p$ covariates and calculate its $R^2$. Then $R^2$ is an estimator of the true population $R^2$. 
4. Let, again, $x = x_1, x_2, ..., x_n$ be some observed data. Then the median (i.e., `np.mean(x)`) of $x$ is an estimator of the theoretical population median. 
5. Is a *p*-value an estimator? No, as it does not attempt to measure a population value. 

#### Theoretical quantities
The theoretical quantities, such as the population mean, population median, and population $R^2$, are quite hard to understand. Many people never understand them at all, but you are different! 😊 They are properties of the data-generating process. Recall the intuition behind the data generating process. Alice, or Vishnu, or any other entity, has a program that generates your observed data. If you *knew* that program, you would have been able to calculate anything you'd ever want to. But you don't, and that's why you need to estimate.

In [6]:
## This is Alice who simulates for Bob. Bob never sees the program.
import numpy as np
rng = np.random.default_rng(seed = 313)
u = rng.random(10) # uniformly distributed variables on [0,1]
x = -np.log(u)
x

array([0.4591469 , 0.767279  , 0.24938788, 0.90998112, 0.49108449,
       1.12724964, 2.24739775, 0.37583599, 4.02495832, 0.49735181])

The random variable $U$ is uniform, and has theoretical mean (expectation) $1/2$. Bob doesn't know what Alice is doing though, so he has to estimate the mean. The random variable $X$ also have an expectation, namely $1$. (If you know integral calculus, you can calculate this yourself!) 

#### Definition (Unbiased estimator)
> An estimator is *unbiased* if $E(\hat{\theta}_n) = \theta$.

Most popular estimators are not unbiased, and it is *not* an important property in most scenarios. For instance, the estimated regression coefficients in a logistic regression are not unbiased. Neither are the estimated regression coefficients when using least absolute deviations. However, the sampled variance $S^2 = \frac{1}{n-1}\sum (X_i - \overline{X})^2$ is unbiased. 

#### Definition (Convergence in probability)
> An estimator $\hat{\theta}_{n}$ converges in probability to $\theta$ if $P(|\hat{\theta}_{n}-\theta|>\epsilon)\to0$ for all $\epsilon>0$ as $n\to\infty$.

#### Definition (Consistency)
> An estimator $\hat{\theta}_n$ is *consistent* for $\theta$ if it converges in probability to $\theta$.

Consistency roughly means that the histogram of an estimator will concentrate aribtrarily well around the true value when $n\to\infty$.
 

In [None]:
## Simulate from the normal distribution

## Simulate the median
## Simulate the mean


It appears that the median and mean are consistent for the $\mu$ parameter in the normal distribution. This is, in fact, true. 

We conclude, informally, that the sample median isn't consistent for the mean of the exponential distribution. Do you understand why? 

This isn't a course in mathematics, and proving consistency is often quite difficult. It is important, however, to know what it means.

### Proposition
> Suppose the model conditions for the linear regression model holds true. Then the regression coefficients $\beta_i$ are unbiased, have variance converging to $0$, and are consistent.

That an estimator $\hat{\theta}$ is unbiased and has variance converging to $0$ actually implies that $\hat{\theta}$ is consistent.

### Proposition
> Suppose that $\hat{\theta}_n$ be unbiased for $\theta$, i.e., $E(\hat{\theta}_n) = \theta$. Moreover, suppose that the variance of $\hat{\theta}$ converges to $0$ as $n\to\infty$. Then $\hat{\theta}_n \to \theta$ in probability. In other words, $\hat{\theta}_n$ is consistent for $\theta$.

##### Proof
Let $\sigma_{n}^{2}=\textrm{Var}\hat{\theta}_{n}.$ By [Chebyshev's inequality](https://en.wikipedia.org/wiki/Chebyshev%27s_inequality),
$$
P(|\hat{\theta}_{n}-\theta|\geq\epsilon)\leq\frac{\sigma_{n}^{2}}{\epsilon^{2}}.
$$
Let $\epsilon$ be fixed. Since $\sigma_{n}^{2}\to0$ by assumption,
$\frac{\sigma_{n}^{2}}{\epsilon^{2}}\to0$ as well. Then, since $P(|\hat{\theta}_{n}-\theta|\geq\epsilon \leq\frac{\sigma_{n}^{2}}{\epsilon^{2}}$,
we find that $P(|\hat{\theta}_{n}-\theta|\geq\epsilon)$ too.

### Corollary: Law of large numbers
Assume $X_n$ is a sequence of identically distributed variables with common mean $\mu$ and finite variance $\sigma^2$. Let $\overline{X}_n$ denote the mean, $\overline{X}_n = n^{-1}\sum_{i=1}^n{X_i}$. Then $X_n\to\mu$.

#### Proof
Exercise.


## Adjusted $R^2$


### Too big $R^2$ values
The $R^2$ is good for evaluating how well be can predict the outcome given our covariates, but it's not good for choosing between models. In a nutshell, it's not good for choosing between models since it doesn't correct for the bias that occurs when using the same data both to estimate the model parameters and evaluating model fit.


### Constructing the adjusted $R^2$
$$R^2 = 1 - \frac{\textrm{Sum of squares with predictors}}{\textrm{Sum of squares without predictors}}$$

We can show that $\textrm{Sum of squares with predictors}$ is biased for its population value, the true sum of squares with predictors at our estimated regression coefficient. But we can correct for this bias! One can show that $$\frac{n}{n-p-1}E(\textrm{Sum of squares with predictors})$$ equals the true population sum of squares with predictors. Here $p$ is the number of estimated regression coefficinets, minus the intercept. Moreover, we can show that 
$$\frac{n}{n-1} E(\textrm{Sum of squares with predictors})$$ 
equals the true, population value of the sum of squares without predictors.

It follows that a reasonable corrected $R^2$ is
$$R_a^2 = 1 - \frac{\frac{n}{n-p-1} \textrm{Sum of squares with predictors}}{\frac{n}{n-1}\textrm{Sum of squares without predictors}}$$

Rearranging this, we find that
$$ R_a ^ 2 = 1 - (1 - R^2) \frac{n-1}{n-p-1} .$$

**Note:** (i) The adjusted $R^2$, or $R^2_a$, can be less than $0$. Try to understand why. (ii) We haven't proved that the adjusted $R^2$ squared is unbiased. Do you think it is? Can you devise a simulation study to explore this problem?

## Summary
1. The expectation of a random variable $X$ is denoted by $E(X)$, and equals the theorotical mean of random variable $X$.
2. An estimator approximates a population value based on observed data.
3. An estimator is *consistent* if it approximates the population value arbitrarily well as $n\to \infty$.
4. An estimator $\hat{\theta}_n$ is *unbiased* if it equals $\theta$ in expectation, i.e., $E[\hat{\theta}_n]=\theta$.
5. Unbiased estimatation is not important, but it makes sense to correct the $R^2$ for bias.
6. One attempt at bias-corrected $R^2$ is the adjusted $R^2$.