<center><img src='https://drive.google.com/uc?export=view&id=12CrUdXDAiltLBT26sG7HZ_HciIhvGyT8'></center>

# Statistical machine learning - Notebook 2
**Author: Michał Ciach, Grzegorz Preibisch, Dorota Celińska-Kopczyńska**  


In today's class, we will learn some aspects of parameter and interval estimation.
In the first section, we will focus on the graphical analysis of the properties of the point estimator (e.g., mean value of a distribution) using artificial data.
In the second section, we will focus on estimating the mean value of a distribution using a statistical sample.   
In the last section, we'll calculate confidence intervals for the mean using real-world data.  


In [None]:
!pip install gdown

In [None]:
!gdown https://drive.google.com/uc?id=1xOJfD-jexDbHSOCg1EiyAxqc5kXjMvX0

## Data & library imports

In [None]:
import pandas as pd
import numpy as np
import numpy.random as rd
import plotly.express as px
from scipy.stats import norm
from scipy.stats import t as tstud
from typing import List
from typing import Callable
import plotly.express as px
import plotly.graph_objects as go

In [None]:
protein_lengths = pd.read_csv('protein_lengths.tsv', sep='\t')
protein_lengths

## Point estimation -- simulational approach

During the lecture, we introduced the properties of the estimator. To develop more intuition about them, in this notebook we will graphically analyze their distributions (and the distributions of some relevant statistics).

Contrary to other notebooks, in this one we will start with analyzing artificial data. The advantage of (pseudo)randomly generated data is that we are able to control nearly every aspect of the study. E.g., if we want to analyze the properties of the estimator, we need to know its true value (the groundtruth, something we know we should obtain). In real-world data we rarely know the true values. Using (pseudo)randomly generated data allows us to set the true values as the parameters for the generation mechanism.

**Exercise 1.** In this exercise we will analyze the distribution of a few estimators (for normally distributed data) from repeated experiments (the number of repetitions will be $N$). The size of the sample ($n$) will be the same in each experiment. To this end:

* Write a function that will return $N$ samples of size $n$ from from gaussian distribution with a given mean $\mu$ and variance $\sigma^2$. Run the function with parameters: $N=1000$, $n = 30$, $\mu = 200$, $\sigma^2=144$
* Based on the dataset from the previous point, for each sample of size $n$ compute the following estimators:

 - mean $\hat{X} = \frac{1}{n} \sum_{i=1}^n X_i$
 - unbiased estimator for variance $\tilde{S}_n^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i-\hat{X})^2$
 - biased estimator for variance $S_n^2 = \frac{1}{n} \sum_{i=1}^n (X_i-\hat{X})^2$
 - estimator for variance with mininal MSE $S_n^{2*} = \frac{1}{n+1} \sum_{i=1}^n (X_i-\hat{X})^2$

Hint: For that purpose, we suggest to write a function and change the functions for the estimators when necessary. For variance estimators, make the fullest use of `np.var`.

* Create the histograms for the values of the computed estimators. Plot the values of the true parameters ($\mu = 200$, $\sigma^2=144$ ). What can you say about the distribution of the estimators? Are there differences in the shape of the distributions of the estimators? Discuss and justify your view.
* Compute the biases, variances, and MSE of the estimators. Do the results agree with the theoretical results?

In [None]:
true_mu = 200
true_stddev = 12
true_var = 144
size = 30

In [None]:
def make_dataset(statistic : Callable[[np.array], float] = np.mean, N : int = 1000, sample_size : int = 3, mu : float = 0, std_dev : float = 1) -> List[np.array]:
  """
  Function which samples N times from gaussian distribution with mean and std_dev
  :param N: number of samples
  :param sample_size: number of observations in one sample
  :return: dataset of N samples of size n from normal distrution
  """
  samples = []
  for step in range(N):
    sample = np.random.normal(mu,std_dev,sample_size)
    samples.append(sample)
  return samples

In [None]:
data = make_dataset(mu=true_mu, std_dev=true_stddev, sample_size=size)

In [None]:
def run_experiment( samples: List[np.array], statistic : Callable[[np.array], float] = np.mean  ) -> List[float]:
  """
  Function which computes estimation for a given batch of data
  :param samples: results of function make_dataset
  :param statistic: function to estimate the parameters
  :return: list of estimated parameters
  """
  results = []
  for sample in samples:
    results.append(statistic(sample))
  return results

In [None]:
def var_unbiased(sample : np.array) -> float:
  return np.var(sample,  ddof=1)

def var_ML(sample : np.array) -> float:
  return np.var(sample,  ddof=0)

# it would be nicer to add an argument with the sample size
def var_minMSE(sample : np.array) -> float:
  return np.var(sample,  ddof=0)*(size/(size+1))

In [None]:
biased = run_experiment(samples = data, statistic = np.var)
unbiased = run_experiment(samples = data, statistic = var_unbiased)
minMSE = run_experiment(samples = data, statistic = var_minMSE)

In [None]:
# code for variance
fig = go.Figure()
fig.add_trace(go.Histogram(x=biased,  name='biased' ))
fig.add_trace(go.Histogram(x=unbiased,  name='unbiased'))
fig.add_trace(go.Histogram(x=minMSE,  name='minMSE'))
# Overlay the histograms
fig.update_layout(barmode='overlay')
# Reduce opacity to see histograms
fig.update_traces(opacity=0.75)
fig.add_vline(x=true_var)
fig.show()

In [None]:
# code for mean
results_mean = run_experiment(samples = data, statistic = np.mean)

fig = go.Figure()
fig.add_trace(go.Histogram(x=results_mean,  name='mean' ))
# Reduce opacity to see histograms
fig.update_traces(opacity=0.75)
fig.add_vline(x=true_mu)
fig.show()

In [None]:
# code for bias
bias_mean = np.mean(np.array(results_mean)) - true_mu
print(bias_mean)
bias_unbiased = np.mean(np.array(unbiased)) - true_var
print(bias_unbiased)
bias_biased = np.mean(np.array(biased)) - true_var
print(bias_biased)
bias_minMSE = np.mean(np.array(minMSE)) - true_var
print(bias_minMSE)

In [None]:
# code for MSE
MSE_mean = np.mean((true_mu - np.array(results_mean))**2)
print(MSE_mean)
MSE_unbiased = np.mean((true_var - np.array(unbiased))**2)
print(MSE_unbiased)
MSE_biased = np.mean((true_var - np.array(biased))**2)
print(MSE_biased)
MSE_minMSE = np.mean((true_var - np.array(minMSE))**2)
print(MSE_minMSE)

**Exercise 2. (optional)**  In the previous exercise, we worked with the distribution of the estimators by running multiple experiments with a given sample size (by default quite small). In this exercise, we will analyze the asymptotic properties of the estimators. To this end:

* Compute 100 samples of each size from 2 to 5000 (you may keep using data from N(200,144)). For each sample compute:

 - $\hat{X} = \frac{1}{n} \sum_{i=1}^n X_i$ as a mean estimator
 - $\hat{X} +10$ as a mean estimator
 - unbiased estimator for variance $\tilde{S}_n^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i-\hat{X})^2$
 - biased estimator for variance $S_n^2 = \frac{1}{n} \sum_{i=1}^n (X_i-\hat{X})^2$

* Scatterplot the obtained values of the parameters against the sample size: y axis should the the value of the parameter, and x axis should be the size of the sample. Add a horizontal line with the true value of the parameter. Compare the plots. Do values of the estimators become closer to real values when the size of the sample increases? Do the results agree with the results on consistency of the estimators? Discuss.

* Compute the biases of the estimators. Plot the biases of the estimators against the sample size. Add a horizontal line in zero. Compare the insights from the plots with the theory of asymptotical unbiasedness.

Hint: We encourage you to use scatterplots, but visualizations of distributions may also come in handy.

In [None]:
def mean_biased(sample : np.array) -> float:
    return np.mean(sample) + 10

In [None]:
max_sample_size = 5000
# For faster computations; should be set to 1 for exact solution
step = 100

sample_sizes = list(range(2,max_sample_size+1,step))

mean_unb = []
mean_b = []
var_unb = []
var_b = []

for sample_size in sample_sizes:
    
    samples = make_dataset(mu=true_mu, std_dev=true_stddev, N=100, sample_size=sample_size)

    mean_unb.append([np.mean(sample) for sample in samples])
    mean_b.append([mean_biased(sample) for sample in samples])
    var_unb.append([var_unbiased(sample) for sample in samples])
    var_b.append([var_ML(sample) for sample in samples])

In [None]:
from itertools import repeat

fig = px.scatter(x=list(repeat(2,100)), y=mean_unb[0], title='Estimated unbiased mean')
for ind in range(1,len(sample_sizes)):
    fig.add_trace(go.Scatter(x=list(repeat(sample_sizes[ind],100)), y=mean_unb[ind], mode='markers',
                             marker=dict(color=fig.data[0].marker.color)))

fig.add_hline(y=true_mu)
fig.update_layout(showlegend=False)
fig.show()

In [None]:
fig = px.scatter(x=list(repeat(2,100)), y=mean_b[0], title='Estimated biased mean')
for ind in range(1,len(sample_sizes)):
    fig.add_trace(go.Scatter(x=list(repeat(sample_sizes[ind],100)), y=mean_b[ind], mode='markers',
                            marker=dict(color=fig.data[0].marker.color)))

fig.add_hline(y=true_mu)
fig.update_layout(showlegend=False)
fig.show()

In [None]:
fig = px.scatter(x=list(repeat(2,100)), y=var_unb[0], title='Estimated unbiased variance')
for ind in range(1,len(sample_sizes)):
    fig.add_trace(go.Scatter(x=list(repeat(sample_sizes[ind],100)), y=var_unb[ind], mode='markers',
                            marker=dict(color=fig.data[0].marker.color)))

fig.add_hline(y=true_var)
fig.update_layout(showlegend=False)
fig.show()

In [None]:
fig = px.scatter(x=list(repeat(2,100)), y=var_b[0], title='Estimated biased variance')
for ind in range(1,len(sample_sizes)):
    fig.add_trace(go.Scatter(x=list(repeat(sample_sizes[ind],100)), y=var_b[ind], mode='markers',
                            marker=dict(color=fig.data[0].marker.color)))

fig.add_hline(y=true_var)
fig.update_layout(showlegend=False)
fig.show()

In [None]:
bias_mean_unb = [np.mean(estimations) - true_mu for estimations in mean_unb]
fig = px.scatter(x=sample_sizes, y=bias_mean_unb, title='Bias of the unbiased estimator of the mean.')
fig.add_hline(y=0)
fig.show()

bias_mean_b = [np.mean(estimations) - true_mu for estimations in mean_b]
fig = px.scatter(x=sample_sizes, y=bias_mean_b, title='Bias of the biased estimator of the mean.')
fig.add_hline(y=0)
fig.show()

bias_var_unb = [np.mean(estimations) - true_var for estimations in var_unb]
fig = px.scatter(x=sample_sizes, y=bias_var_unb, title='Bias of the unbiased estimator of the variance.')
fig.add_hline(y=0)
fig.show()

# The difference can be better seen with sample sizes in the range [2,100].
bias_var_b = [np.mean(estimations) - true_var for estimations in var_b]
fig = px.scatter(x=sample_sizes, y=bias_var_b, title='Bias of the biased estimator of the variance.')
fig.add_hline(y=0)
fig.show()

**Exercise 3 -- optional homework.** Find an MLE estimator for $\lambda$ in exponential distribution (probability density function  $f(x) = \lambda \exp(-\lambda x)$ for $x \in [0, \infty)$), using pen and paper.
Prepare a similar analysis of the properties of the found estimator as the analyses in Exercise 1 and 2 (analysis of the distribution with a given sample size and the analysis of the asymptotical properties based on the increasing sample sizes). You may assume true $\lambda = 2$. Discuss your results.

## Exploratory analysis
The first step to any statistical analysis is to explore the data - check the basic statistics like the mean and variance, and visualize the data to see what kind of distribution we're dealing with.

**Exercise 4.** In this exercise, we'll extract the data about human proteins, perform a simple data transformation, and do a basic exploratory analysis.

1. Select the data about human protein lengths from the `protein_lengths` data frame, and put it into a data frame `human_protein_lengths`. Here, you may need to use the `.copy()` method for the subsequent steps to work (ask your tutor if you need a further explanation).
2. Calculate the base-10 logarithm of the protein length and append it to the `human_protein_lengths` data frame as a column called `LogLength`.
3. Use the `human_protein_lengths.describe()` method to check the basic statistics of the numerical columns of the data frame. What is the average length of a human protein? What is the maximum length?  
4. Use the `px.histogram()` functions to create histograms showing the distributions of the protein lengths and log-lengths. Which distribution is more spread around its average? Does any distribution resemble the Normal (Gaussian) distribution? Are there many proteins with lengths similar to the maximum length, or just a few?  
5. Calculate the average length and log-length and their standard deviations; Store them in variables `true_mean`, `true_mean_log`, `true_std` and `true_std_log`. We'll use them in subsequent exercises as our *ground truth* against which we'll evaluate our estimators.     

*Quick question*. Is $\text{true_mean}$ equal to $10^\text{true_mean_log}$? Why/why not? (note that we've used the base-10 logarithm)

*Why the base-10 logarithm?* Mathematicians usually prefer to use the natural logarithms. In statistical data analysis, we sometimes use also the base-10 logarithms, because their values are easier to interpret as orders of magnitude (or simply the numbers of digits) of our values. Although the logarithms are mostly equivalent mathematically, an easier interpretation is important to get more meaningful conclusions from the data.

*Why the standard deviation?* Some students may wonder why we prefer the standard deviation rather than the variance - after all, the difference is just a square root, so mathematically it's almost the same thing. Here, again, the reason is the interpretability of the results. When we compute the variance, we square the observations. As a consequence, their units also get squared. This means that, if we estimate the number of mushrooms in a forest, the variance is expressed in terms of *mushrooms squared*, which doesn't make any sense. Taking a square root brings the unit back to mushrooms.   

In [None]:
human_protein_lengths = protein_lengths.loc[protein_lengths['Common name'] == 'Human'].copy()
# Note: without .copy(), some versions of Pandas may return a View.
# This may interfere with adding a new column to human_protein_lengths.
print(human_protein_lengths.head())

human_protein_lengths['LogLength'] = np.log10(human_protein_lengths['Protein length'])

print(human_protein_lengths.describe())

fig = px.histogram(x=human_protein_lengths['Protein length'], title="Distribution of the human protein lengths")
fig.show()
fig = px.histogram(x=human_protein_lengths['LogLength'], title="Distribution of the human protein log-lengths")
fig.show()

true_mean = human_protein_lengths['Protein length'].mean()
true_mean_log = human_protein_lengths['LogLength'].mean()
true_std = human_protein_lengths['Protein length'].std()
true_std_log = human_protein_lengths['LogLength'].std()

## Point estimation

One of the main strengths of statistical theory is that it allows us to estimate many quantities (like the mean protein length, mean income in a country, or voting preferences)  using only a sample of randomly selected observations, and, most importantly, to estimate the uncertainty of such estimation. This is the main reason why we derive properties of estimators, such as their expected value and variance. Good statisticians can derive estimators which need less observations and give better results.   

**Exercise 5.** In this exercise, we'll do an empirical analysis of the properties of the estimator of the mean. We'll use a sample of $N=1000$ randomly selected human proteins. Denote $X_i$ as the length of a randomly selected human protein, and $\log(X_i)$ as its base-10 logarithm. Define the following two estimators:

$$\hat{\mu}_X = \sum_{i=1}^N X_i/N, \text{an estimator of the mean length}$$

$$\hat{\mu}_{\log(X)} = \sum_{i=1}^N \log(X_i)/N, \text{an estimator of the mean log-length}$$  

First, we'll draw $R=2000$ independent samples and calculate the estimators. Here's an example way to do this:   

1. Create empty lists called e.g. `means` and `means_log`.  
2. Repeat the following $R$ times (e.g. using a `for` loop):  
    2.1. Get a random sample of size $N$ of the observations (i.e. rows) from `human_protein_lengths`; you can use the `.sample()` method.   
    2.2. Calculate the mean length and append to `means`  
    2.3. Calculate the mean log-length and append to `means_log`.     
3. Convert both lists to `numpy` arrays (e.g. `means = np.array(means)`)

Now, we can inspect how well the estimators approximate the *true* mean length $\mu_X$ and the *true* mean log-length $\mu_{\log{X}}$ (notice the lack of hats above $\mu$'s - this means that these are the true parameters, not estimators).

4. Estimate the mean value of the estimator of the mean (by running `np.mean(means)`). Is it close to the true value $\mu$? In other words, does the estimator seem *unbiased*?
5. Estimate the bias of the estimator of the mean log-length (using the values in `means_log`). Does it seem biased? Does the result agree with the theoretical one about the estimator of the mean?       
5. Estimate the Root Mean Square Error of the estimator of the mean, defined as $\text{RMSE}(\hat{\mu}_X) = \sqrt{\mathbb{E}(\hat{\mu}_X - \mu_X)^2}$. This will tell you, approximately, the average error of $\hat{\mu}_X$ in terms of the number of amino acids (the building blocks of proteins). Why did I write *approximately*? (Hint: it's not just becasue we estimate it rather than calculate it theoretically)
6. Estimate $\text{RMSE}(\hat{\mu}_{\log(X)})$. How can you interpret the result?
7. Estimate the standard deviations of the estimators. Which one is less variable? Does it mean that one quantity is easier to estimate than the other?
8. Is $\text{sd}(\hat{\mu}_X) = \text{RMSE}(\hat{\mu}_X)$? Why/why not? Is the equation always true, sometimes true, or never true?    


In [None]:
## Part 1
N = 1000  # Sample size
R = 1000  # Number of replications
means = []
means_log = []
for i in range(R):
  sample = human_protein_lengths.sample(N)
  means.append(sample['Protein length'].mean())
  means_log.append(sample['LogLength'].mean())
means = np.array(means)
means_log = np.array(means_log)

In [None]:
## Part 2
print()
print('Symmetry of the distributions of the estimators:')
print('Fraction of estimates over the true mean of lengths:', np.mean(means > true_mean))
print('Fraction of estimates over the true mean of log-lengths:', np.mean(means_log > true_mean_log))
print()
print('Means of the estimators:')
print('Mean for mean length:', np.mean(means))
print('Mean for mean log-length:', np.mean(means_log))
print()
print('Bias of the estimators:')
print('Bias for mean length:', np.mean(means - true_mean))
print('Bias for mean log-length:', np.mean(means_log - true_mean_log))
print()
print('RMSE of the estimators:')
print('RMSE for mean length:', np.sqrt(np.mean((means - true_mean)**2)))
print('RMSE for mean length:', np.sqrt(np.mean((means_log - true_mean_log)**2)))
print()
print('Variability of the estimators:')
print('SD for mean length:', np.std(means))
print('SD for mean log-length:', np.std(means_log))

**Exercise 6.** The standard deviation of an unbiased estimator tells us how much it fluctuates around the true value. An estimator with a lower standard deviation will more often give us values that are close to the true one. This way, we can compare two estimators of the same thing.  

However, the standard deviation is often not that useful when comparing the measurements of two different things. This is because it depends on the units of the measurement. Suppose we measure the length $L$ of some objects in meters, and the standard deviation of the measurement is $\text{sd}(L)$. Measuring the same object in centimeters will give us a measurement equal $100L$, and the corresponding standard deviation $\text{sd}(100L) = 100\text{sd}(L)$ will appear to be much larger, but it doesn't mean that measurements in centimeters are more difficult. To make matters worse, in real-life applications, the variability of the measurement often depends on its average value, regardless of the units. The standard deviation of the height of a mouse (a few milimeters) is much lower than the one of an elephant (several centimeters), but it doesn't mean that mice are easier to measure. Similarly, in the case of protein length and log-length, the latter is much smaller, so it can be expected that its standard deviation will be smaller as well.

To evaluate the variability of an estimator regardless of its units and the average value, we can calculate a so-called [*coefficient of variation*](https://en.wikipedia.org/wiki/Coefficient_of_variation) (variation, not variance!). For a random value $Y$, this is defined as $\text{cv}(Y) = \text{sd}(Y)/\mathbb{E}(Y)$.   

7. Calculate the coefficients of variation for the estimators of mean protein length, i.e. $\text{cv}(\hat{\mu}_X)$, and log-length, i.e. $\text{cv}(\hat{\mu}_{\log(X)})$. Which estimator is better in this case? In general, does a lower coefficient of variation always mean that an estimator is better?
8. Is $\text{cv}(Y)$ always equal to  $\text{sd}(Y/\mathbb{E}(Y))$? Is there a condition for $Y$ that makes it equal? Give an analytical argument and verify that empirically on the protein length data.





In [None]:
print()
print('Coefficient of variation of the estimators:')
print('CV for mean length:', np.std(means)/np.mean(means))
print('CV for mean log-length:', np.std(means_log)/np.mean(means_log))
print()
print('Coefficient of variation of the estimators, method 2:')
# Yup, it's that simple - we just change the location of one bracket.
# But in more advanced applications it's worth to realize and remember this property of the std.
print('CV for mean length:', np.std(means/np.mean(means)))
print('CV for mean log-length:', np.std(means_log/np.mean(means_log)))
print()
print('Relative RMSE of the estimators:')
print('Relative RMSE for mean length:', np.sqrt(np.mean((means - true_mean)**2))/true_mean)
print('Relative RMSE for mean length:', np.sqrt(np.mean((means_log - true_mean_log)**2))/true_mean_log)

## Interval estimation

In the previous section, we've learned how to quantify and analyze the uncertainty of an estimation by analyzing the standard deviation of the estimator. In this section, we will learn a different technique - the estimation of *confidence intervals*, i.e. intervals which are likely to contain the true value of the parameter of interest.

In general, we say that a confidence interval $[A, B]$ for a parameter $\theta$ has a confidence level $\alpha$ if it contains the true value of the parameter $\theta$ with probability $\alpha$:

$$\mathbb{P}(A \leq \theta \leq B) = \alpha$$

Above, $A$ and $B$ are random variables calculated from the data (i.e. $A$ and $B$ are *statistics*). Note: some authors use a different terminology and would call this a level $1-\alpha$; some authors also use a more general definition with $\mathbb{P} \geq \alpha$ instead of $\mathbb{P} = \alpha$, because we often can't determine the exact probability and can only give its lower bound (you've seen this in the lecture with the Chebyshev confidence intervals).

In principle, we can construct confidence intervals for any parameter of any distribution (e.g. the expected value, the variance, the proportion in the Bernoulli distribution, the shape parameter of the Gamma distribution etc.), but this is often difficult in practice. We'll focus on confidence intervals for the true mean (i.e. the expected value) of a normally distributed population - this is one of the most commonly used and one of the most useful confidence intervals.

For the expected value of a normally distributed random variable, there are two commonly used confidence intervals: the conficence interval for a *known* $\sigma$, given by the equation

$$\left (\hat{\mu} - q_{(1+\alpha)/2}\frac{\sigma}{\sqrt{N}},\quad \hat{\mu} + q_{(1+\alpha)/2}\frac{\sigma}{\sqrt{N}} \right ), $$
where $q_{(1+\alpha)/2}$ is the quantile of the standard normal distribution at the level of $(1+\alpha)/2$; and the confidence interval for an *unknown* $\sigma$, given by the equation

$$\left (\hat{\mu} - t_{(1+\alpha)/2, N-1}\frac{\hat{\sigma}}{\sqrt{N}},\quad \hat{\mu} + t_{(1+\alpha)/2, N-1}\frac{\hat{\sigma}}{\sqrt{N}} \right ), $$
where $t{(1+\alpha)/2, N-1}$ is the quantile of the Student's $t$ distribution with $N-1$ degrees of freedom at the level of $(1+\alpha)/2$, and $\hat{\sigma}$ is the square root of the **unbiased** estimator of the variance, i.e. $\hat{\sigma} = \sqrt{\sum_{i=1}^N (X_i - \bar{X})^2/(N-1)}$, where $\bar{X} = \hat{\mu} = \sum_{i=1}^N X_i$.

If we simply plug $\hat{\sigma}$ instead of $\sigma$ in the first kind of the confidence interval (the one for a known $\sigma$), we get a third type of a confidence interval, a so-called *asymptotic confidence interval* for the mean; the name *asymptotic* comes from the fact that $\hat{\sigma} → \sigma$ as $N → ∞$. As a consequence of this convergence, the asymptotic confidence interval gives quite accurate results for large sample sizes.  

**Exercise 10.** In this exercise, we'll do an empirical comparison of the properties of the three types of confidence intervals using the protein log-length data. We'll sample $R$ samples of some size $N$, calculate the corresponding confidence intervals for the mean, and check whether they have the desired confidence level and compare their lengths.   

First, we'll prepare our data.    
1. Create empty lists (or `numpy arrays`) that will contain the information whether the true mean is within a confidence interval (e.g. `within_normal` for the confidence interval with a known $\sigma$, `within_student` for the confidence interval with an unknown $\sigma$, `witin_asymptotic` for the asymptotic confidence interval).   
2. Create empty lists (or `numpy arrays`) that will contain the lengths of the intervals.    
3. Repeat the following $R=1000$ times (or more):  
  3.1. Select a random sample of size $N$ of protein log-lengths; select $N$ of your choice.   
  3.2. Calculate the three confidence intervals on the confidence level 95%. For the quantiles, you can use the `norm.ppf` and `t.ppf` functions from the `scipy.stats` package. For the normal confidence interval, use the known standard deviation in `true_mean_log`. Pay attention to the type of the estimator of standard deviation that you use! Some packages use the unbiased estimator of the variance, some don't!  
  3.3. Calculate the lengths of the confidence intevals and append them to the corresponding lists.
  3.4. Check whether the confidence intervals contain the true average log-length $\mu_{\log(X)}$, append the information to the corresponding lists.  

Now, we'll use the generated data to analyze the properties of the confidence intervals.

4. For each type of the confidence interval, estimate the probability that it contains $\mu_{\log(X)}$. Is the estimated probability close to the desired confidence level for each type? Why/why not? Does the answer depend on $N$?  
5. Calculate the average length of each type of the confidence interval. Which type tends to give the shortest intervals? Which type tends to give the longest? Why?    
6. Plot histograms depicting the distribution of the lengths of the confidence intervals.
7. What are the advantages and disadvantages of each type of the confidence interval? Does the asymptotic confidence interval have any advantages over the other two?   
8. *Quick question.* For a single sample, do all of the three types of confidence intervals always contain $\hat{\mu}_{\log(X)}$?  

Repeat this exericise using the protein lengths instead of log-lengths. What went wrong and why? Does the answer depend on $N$?      


In [None]:
# Setting parameters
alpha = 0.95
R = 2000
N = 20
sqrtN = np.sqrt(N)
q = norm.ppf((1+alpha)/2)
t = tstud.ppf((1+alpha)/2, N-1)

# Generating the data
normal_radius = q*true_std_log/sqrtN  # this is non-random; chage here to "true std" for protein lengths
within_normal = []
within_student = []
within_asymptotic = []
length_student = []
length_asymptotic = []
for _ in range(R):
  sample = human_protein_lengths['LogLength'].sample(N)  # change here to 'Protein length' for protein lengths
  mdiff = abs(sample.mean() - true_mean_log)  # change here to "true mean" for protein lengths
  std_log = sample.std() # pandas uses the N-1 normalization

  student_radius = t*std_log/sqrtN
  asymptotic_radius = q*std_log/sqrtN

  length_student.append(2*student_radius)
  length_asymptotic.append(2*asymptotic_radius)

  within_normal.append(mdiff <= normal_radius)
  within_student.append(mdiff <= student_radius)
  within_asymptotic.append(mdiff <= asymptotic_radius)

# Analyze the confidence intervals
print('Estimated confidence levels:')
print('Gaussian confidence interval:', np.mean(within_normal))
print('Student confidence interval:', np.mean(within_student))
print('Asymptotic confidence interval:', np.mean(within_asymptotic))
print()
print('Average lengths:')
print('Gaussian confidence interval:', 2*normal_radius)
print('Student confidence interval:', np.mean(length_student))
print('Asymptotic confidence interval:', np.mean(length_asymptotic))

px.histogram(x=length_student, title="Student confidence interval length").show()
px.histogram(x=length_asymptotic, title="Asymptotic confidence interval length").show()



**Exercise 11 -- optional homework.** The length of the confidence interval is another (and more common) way to determine the required sample size. In this exercise, we'll see how to use it. As in exercise 6, we'll focus on the case of a known standard deviation.  

1. Using the formula for the confidence interval with a known standard deviation, derive a formula for a necessary sample size $N^*$ such that the length of the confidence interval is at most some value $l$.  
2. Calculate the required sample size for the average log-length for $l = 0.3$ (approximately 10% of the true mean).   
3. Select a sample of the size calculated in the previous point and calculate an example confidence interval. Check if its length really is at most $l$.  
  3.1. *Quick question 1.* Is the length of this confidence interval a random variable? Does it depend on the random sample?     
4. *Quick question 2.* Do your calculations work for protein lengths as well?  
5. Take a look at the formula for the confidence interval for an unknown standard deviation. Can you see why it may be difficult to derive a formula for $N^*$ in this case?  
  5.1.\* How would you approach calculating $N^*$ in this case?     

Answer to 1): We need to solve $2q_{(1+\alpha)/2} \sigma/\sqrt{N^*} = l$, this gives us $N^* = 4q_{(1+\alpha)/2}^2\sigma^2/l^2$.

Answer to 3.1): No. The length is given by
$$q_{(1+\alpha)/2}\frac{\sigma}{\sqrt{N}},$$
where $\sigma$ is the true, known value of the standard deviation.

Answer to 4): They don't work for small samples, which is the use case for this kind of equation, so it basically doesn't work at all.   

Answer to 5): Because we'd need to invert the Student's t distribution quantile function.

Answer to 5.1): This can be done numerically.  

In [None]:
import math

alpha = 0.95
q = norm.ppf((1+alpha)/2)
## Get the number of samples so that the length of the confidence interval
## is equal to 0.3, i.e., approx. 0.1 of the true mean:
c = 0.3/true_mean_log
N = math.ceil(4*q**2*true_std_log**2/(c**2*true_mean_log**2))
print('We need', N, 'samples for a confidence interval of length at most', c*true_mean_log)
sample = human_protein_lengths['LogLength'].sample(N)
mu = np.mean(sample)
s = np.std(sample)
conf = [mu - q/np.sqrt(N)*true_std_log, mu + q/np.sqrt(N)*true_std_log]
print('Example confidence interval from', N, 'samples:', conf)
print('Actual length:', conf[1]-conf[0])  # this is independent
conf_asymptotic = [mu - q/np.sqrt(N)*s, mu + q/np.sqrt(N)*s]
print('Example asymptotic confidence interval from', N, 'samples:', conf_asymptotic)
print('Length of the corresponding asymptotic confidence interval:', conf_asymptotic[1] - conf_asymptotic[0])

<center><img src='https://drive.google.com/uc?id=1_utx_ZGclmCwNttSe40kYA6VHzNocdET' height="60">

AI TECH - Akademia Innowacyjnych Zastosowań Technologii Cyfrowych. Program Operacyjny Polska Cyfrowa na lata 2014-2020
<hr>

<img src='https://drive.google.com/uc?id=1BXZ0u3562N_MqCLcekI-Ens77Kk4LpPm'>


Projekt współfinansowany ze środków Unii Europejskiej w ramach Europejskiego Funduszu Rozwoju Regionalnego
Program Operacyjny Polska Cyfrowa na lata 2014-2020,
Oś Priorytetowa nr 3 "Cyfrowe kompetencje społeczeństwa" Działanie  nr 3.2 "Innowacyjne rozwiązania na rzecz aktywizacji cyfrowej".   
Tytuł projektu:  „Akademia Innowacyjnych Zastosowań Technologii Cyfrowych (AI Tech)”
    </center>