# EBA3500 Exercises 11: Expectation, consistency, and the adjusted $R^2$

## Difficulty classifications
* 🐇: Should be very easy for everyone.
* 🐖: Should be very easy for some, but harder for others.
* 🦢: Should demand some work to finish.
* 🐅: A challenge exercise that isn't strictly part of the curriculum. 

## Exercise 1: Unbiased estimation

### (a) (🐖)
Show that the sample mean, $\frac{1}{n}\sum_i x_i$, is unbiased for the $E(X)$. (**Hint:** Linearity of expectation.)

### (b) (🦢)
The median is sometimes biased, sometimes not. Using the `np.median` function, write a simulation checking if the median estimator is unbiased for the normal distribution with mean $\mu$ and standard deviation $1$, and the exponential distribution with parameter $\lambda$. You may use that:

1. The median of a normal distribution equals its mean.
2. The median of an exponential distribution with parameter $\lambda$ equals $\frac{1}{\lambda} \log 2$.

### (c) (🐅) (For those who know how to integrate!)
Let $U$ be uniform, which implies that the probability density function of $U$ is $p(u) = 1$. Calculate the expectation of $-\log U$. (**Hint:** Either (i) ise integration by parts on $-\int 1\cdot\log u du$, or (ii) use the substitution $x = \log u$ and a more obvious integration by parts.)

### (d) (🐅) 
The expected sample median of the exponential distribution has a closed form. Use for instance [this](https://math.stackexchange.com/questions/80475/order-statistics-of-i-i-d-exponentially-distributed-sample) answer to deduce it, and compare it to the result in exercise (b). (**Note:** The exponential distribution is unusually well-behaved. Don't expect to find closed form solutions for the expected sample medians for general non-symmetric distributions.)

## Exercise 2: The unbiased sample variance estimator

We will show that $S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \overline{X})^2$ is an unbiased estimator of $\textrm{Var}X = E(X^2) - (EX)^2$ provided the *covariance*, i.e., $$\textrm{Cov}(X_i, X_j) = E(X_i X_j) - E X_i E X_j = 0.$$
We will assume the variables are independent, which implies that $E(X_i) = E(X)$ and $E(X_i^2) = E(X^2)$ for all $i$.

**To make calculations easier, we will assume that $E(X_i) = 0$.**

### (a) (🐇)
Show that 
$$(X_{i}-\frac{1}{n}\sum_{i=1}^{n}X_{i})^{2}=X_{i}^{2}-\frac{2}{n}X_{i}\left(\sum_{i=1}^{n}X_{i}\right)+\frac{1}{n^{2}}\left(\sum_{i=1}^{n}X_{i}\right)^{2}.$$


### (b) (🐇)
Deduce that
$$(n-1)S^2 = \sum_{i=1}^{n}X_{i}^{2}-\frac{2}{n}(\sum_{i=1}^{n}X_{i})^{2}+\frac{1}{n}(\sum_{i=1}^{n}X_{i})^{2}

=\sum_{i=1}^{n}X_{i}^{2}-\frac{1}{n}(\sum_{i=1}^{n}X_{i})^{2}.$$


### (c) (🐖)
Show that $(\sum_{i=1}^{n}X_{i})^{2}=\sum_{i=1}^{n}\sum_{j=1}^{n}X_{i}X_{j}.$

Deduce that $$(\sum_{i=1}^{n}X_{i})^{2} =\sum_{i=1}^{n}X_{i}^{2}+\sum_{i=1}^{n}\sum_{j\neq i}X_{i}X_{j}.$$ 

### (d) (🐖)
Show that $E[\frac{1}{n}\sum_{i=1}^n X_{i}^{\alpha}]=E[X_{i}^{\alpha}]$ for any $\alpha$. Use this to simplify $E[\sum_{i=1}^{n}X_{i}^{2}]$. Moreover, explain why $E[\sum_{i=1}^{n}\sum_{j\neq i}X_{i}X_{j}] = 0$.

### (e) (🐖)
Finish the argument that $ES^2 = \textrm{Var}(X)$.

## Exercise 3: Adjusted $R^2$

### (a) (🐇) Plotting a function
For $n = 10, 50, 100, 1000, 10000$, plot the function $E(R^2) = (k-1)/(n-1)$ for $k=1,...n$. Interpret the plots.

### (b) (🐖) Variance of the $R^2$
Suppose the true population $R^2$ is equal to $0$. Calculate the variance of the population $R^2$. Plot the variance function, just as you did for $E(R^2)$ in the previous exercise.  (**Hint:** What is the distribution of $R^2$ under this assumption?)

### (c) (🐖) Simulation function, take 1
Consider the regression setup in the lecture, where all the $\beta_i$ coefficients are $0$

In [1]:
import numpy as np
rng = np.random.default_rng(seed = 313)
p = 10
n = 100
x = rng.normal(0, 1, (n, p))
y = rng.normal(3, 2, n)

Make a function that simulates the distribution of the $R^2$ and the adjusted $R^2$ when $x$ is kept fixed across simulations and $y$ is independent of $x$.

In [None]:
def rsqs(n, p, n_reps):
    """ document! """
    x = rng.normal(0, 1, (n, p))
    # simulate rsqs and adjusted rsqs
    return {adjusted_rsq: , rsq:}

### (d) (🐖) Comparing the $R^2$ values
Make a function that makes histograms of the simulated $R^2$ values from the previous function. 

### (e) (🐖) Simulate.
For a reasonable selection of $p$ and $n$, simulate the $R^2$ and plot them in a nice way. Does the adjusted $R^2$ appear to be unbiased?

### (d) (🦢) Simulation function, take 2
Modify the function in (c) to take a `beta` vector of regression coefficients. Then do the same as in (d - e).
