# CS 639 - Foundations of Data Science

# Discussion 03 (Estimating variance and Set Balancing Problem)

In this dicussion session, we are going to cover more examples of using Hoeffding Bound, including taking a deeper dive into a practical application.

* Bounding the population variance from n i.i.d. samples X_i: Define Y_i = (X_i - E(X))^2 then apply Hoeffding bound.
* The set balancing problem: How to apply the mathematical tools we learnt in a real-life problem to get useful intuitions. (Section 4.4 from M. Mitzenmacher, E. Upfal: Probability and Computing)

---------

## Estimating Variance

Given $n$ samples $X_1, \cdots, X_n$ where $X_i$'s are drawn independently from a bounded random variable $X$, we would like to say something about $X$ based on the samples we observe. We have seen in class that we can give a confidence bound for the expectated value of $X$ by using an estimator $S_n /n$ for the mean then by applying Hoeffding bound. One less obvious thing we can also do is to give a similar estimate for the variance of $X$, if we know the true mean $\mathbb{E}[X]$.

The trick is to obverse that the variance of $X$ is itself an expectation. Recall that the variance of $X$ is defined as $\mathrm{Var} [X] = \mathbb{E} [(X - \mathbb{E}[X])^2]$, we can see that if we define a new random variable $Y = (X - \mathbb{E}[X])^2$, then $\mathrm{Var} [X] = \mathbb{E} [Y]$. Then we can apply Hoeffding's inequality to give a confidence interval for the true variance of $X$ based on the empirical variance from $X_1, \cdots, X_n$.

In particular, we will generate $Y_1,\cdots, Y_n$ where $Y_i = (X_i - \mathbb{E}[X])^2$ for $i \in [n]$.
* (Boundedness) First, let us argue that $Y_i$'s are bounded: Suppose $X \in [a, b]$ for some $a, b \in \mathbb{R}$, then we know $Y \in [a', b']$ where $a' = 0$ and $b' = (b-a)^2$ (why?).
* (Independence) Next, let us argue that $Y_i$'s are generated independently as well. Note that $\mathbb{E}[X]$ assumed to be a known constant, so we can conclude that $Y_i$'s are independent.

Therefore, we can use Hoeffding bound to bound the deviation of $\frac{1}{n} \sum_{i=1}^n Y_i$ from the variance of $X$, $\mathbb{E} [Y]$. Notice that $\frac{1}{n} \sum_{i=1}^n Y_i$ is exaclty our empirical variance, and we will denote it as $\hat{\sigma}^2_n (X)$.

Similar to we have done in the lectures with Hoeffding bound, we get
$$
  \mathbb{P} \left[\left|\frac{1}{n} \sum_{i=1}^n Y_i - \mathbf{E}[Y] \right| \ge t \right] \le 2 e^{\frac{-2nt^2}{(b' - a'
  )^2}}.
$$

Using the values for $a'$ and $b'$ and rewriting the expression using $X$, we get
$$
  \mathbb{P} [|\hat{\sigma}_n^2 (X) - \mathrm{Var}[X] | \ge t ] \le 2 e^{\frac{-2nt^2}{(b - a
  )^4}}.
$$

In plain words, we conclude that the probalility that our empirical variance is far away from the population variance with small probaility as the number of sample $n$ scales.


---------

## Set Balancing Problem

In many statistical trials, we would like to divide the candidates into two groups: The control group, and the experimental group. Ideally we would like the two groups to be relatively similar in terms of their identifiable features. For example, if we want to conduct medical trials to determine the efficacy of a COVID vaccine, we would want to make sure the age, race, gender and other factors of the two groups are similar. 

This problem can be formulated mathematically as a Set Balancing Problem. Using the tools we have learnt from class so far, we can actually say something about how good (or bad) we can do without knowing anything about the features of any candidates. Please refer to Section 4.4 from M. Mitzenmacher, E. Upfal: Probability and Computing. 