# Statistical machine learning - Notebook 2, version for students
**Author: Michał Ciach, Grzegorz Preibisch, Dorota Celińska-Kopczyńska**  


In today's class, we will learn some aspects of parameter and interval estimation.
In the first section, we will focus on the graphical analysis of the properties of the point estimator (e.g., mean value of a distribution) using artificial data.
In the second section, we will focus on estimating the mean value of a distribution using a statistical sample.   
In the last section, we'll calculate confidence intervals for the mean using real-world data.  


In [None]:
!pip install gdown



In [None]:
!gdown https://drive.google.com/uc?id=1xOJfD-jexDbHSOCg1EiyAxqc5kXjMvX0

Downloading...
From: https://drive.google.com/uc?id=1xOJfD-jexDbHSOCg1EiyAxqc5kXjMvX0
To: /content/protein_lengths.tsv
  0% 0.00/29.3M [00:00<?, ?B/s] 59% 17.3M/29.3M [00:00<00:00, 151MB/s]100% 29.3M/29.3M [00:00<00:00, 183MB/s]


## Data & library imports

In [None]:
import pandas as pd
import numpy as np
import numpy.random as rd
import plotly.express as px
from scipy.stats import norm
from scipy.stats import t as tstud
from typing import List
from typing import Callable
import plotly.express as px
import plotly.graph_objects as go

In [None]:
protein_lengths = pd.read_csv('protein_lengths.tsv', sep='\t')
protein_lengths

Unnamed: 0,Scientific name,Common name,Protein ID,Protein length
0,Homo sapiens,Human,NP_000005.3,1474
1,Homo sapiens,Human,NP_000006.2,290
2,Homo sapiens,Human,NP_000007.1,421
3,Homo sapiens,Human,NP_000008.1,412
4,Homo sapiens,Human,NP_000009.1,655
...,...,...,...,...
648731,Imleria badia,Bay bolete (mushroom),KAF8560453.1,494
648732,Imleria badia,Bay bolete (mushroom),KAF8560454.1,737
648733,Imleria badia,Bay bolete (mushroom),KAF8560455.1,554
648734,Imleria badia,Bay bolete (mushroom),KAF8560456.1,813


## Point estimation -- simulational approach

During the lecture, we introduced the properties of the estimator. To develop more intuition about them, in this notebook we will graphically analyze their distributions (and the distributions of some relevant statistics).

Contrary to other notebooks, in this one we will start with analyzing artificial data. The advantage of (pseudo)randomly generated data is that we are able to control nearly every aspect of the study. E.g., if we want to analyze the properties of the estimator, we need to know its true value (the groundtruth, something we know we should obtain). In real-world data we rarely know the true values. Using (pseudo)randomly generated data allows us to set the true values as the parameters for the generation mechanism.

**Exercise 1.** In this exercise we will analyze the distribution of a few estimators (for normally distributed data) from repeated experiments (the number of repetitions will be $N$). The size of the sample ($n$) will be the same in each experiment. To this end:

* Write a function that will return $N$ samples of size $n$ from from gaussian distribution with a given mean $\mu$ and variance $\sigma^2$. Run the function with parameters: $N=1000$, $n = 30$, $\mu = 200$, $\sigma^2=144$
* Based on the dataset from the previous point, for each sample of size $n$ compute the following estimators:

 - mean $\hat{X} = \frac{1}{n} \sum_{i=1}^n X_i$
 - unbiased estimator for variance $\tilde{S}_n^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i-\hat{X})^2$
 - biased estimator for variance $S_n^2 = \frac{1}{n} \sum_{i=1}^n (X_i-\hat{X})^2$
 - estimator for variance with mininal MSE $S_n^{2*} = \frac{1}{n+1} \sum_{i=1}^n (X_i-\hat{X})^2$

Hint: For that purpose, we suggest to write a function and change the functions for the estimators when necessary. For variance estimators, make the fullest use of `np.var`.

* Create the histograms for the values of the computed estimators. Plot the values of the true parameters ($\mu = 200$, $\sigma^2=144$ ). What can you say about the distribution of the estimators? Are there differences in the shape of the distributions of the estimators? Discuss and justify your view.
* Compute the biases, variances, and MSE of the estimators. Do the results agree with the theoretical results?

**Exercise 2. (optional)**  In the previous exercise, we worked with the distribution of the estimators by running multiple experiments with a given sample size (by default quite small). In this exercise, we will analyze the asymptotic properties of the estimators. To this end:

* Compute 100 samples of each size from 2 to 5000 (you may keep using data from N(200,144)). For each sample compute:

 - $\hat{X} = \frac{1}{n} \sum_{i=1}^n X_i$ as a mean estimator
 - $\hat{X} +10$ as a mean estimator
 - unbiased estimator for variance $\tilde{S}_n^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i-\hat{X})^2$
 - biased estimator for variance $S_n^2 = \frac{1}{n} \sum_{i=1}^n (X_i-\hat{X})^2$

* Scatterplot the obtained values of the parameters against the sample size: y axis should the the value of the parameter, and x axis should be the size of the sample. Add a horizontal line with the true value of the parameter. Compare the plots. Do values of the estimators become closer to real values when the size of the sample increases? Do the results agree with the results on consistency of the estimators? Discuss.

* Compute the biases of the estimators. Plot the biases of the estimators against the sample size. Add a horizontal line in zero. Compare the insights from the plots with the theory of asymptotical unbiasedness.

Hint: We encourage you to use scatterplots, but visualizations of distributions may also come in handy.

**Exercise 3 -- optional homework.** Find an MLE estimator for $\lambda$ in exponential distribution (probability density function  $f(x) = \lambda \exp(-\lambda x)$ for $x \in [0, \infty)$), using pen and paper.
Prepare a similar analysis of the properties of the found estimator as the analyses in Exercise 1 and 2 (analysis of the distribution with a given sample size and the analysis of the asymptotical properties based on the increasing sample sizes). You may assume true $\lambda = 2$. Discuss your results.

## Exploratory analysis
The first step to any statistical analysis is to explore the data - check the basic statistics like the mean and variance, and visualize the data to see what kind of distribution we're dealing with.

**Exercise 4.** In this exercise, we'll extract the data about human proteins, perform a simple data transformation, and do a basic exploratory analysis.

1. Select the data about human protein lengths from the `protein_lengths` data frame, and put it into a data frame `human_protein_lengths`. Here, you may need to use the `.copy()` method for the subsequent steps to work (ask your tutor if you need a further explanation).
2. Calculate the base-10 logarithm of the protein length and append it to the `human_protein_lengths` data frame as a column called `LogLength`.
3. Use the `human_protein_lengths.describe()` method to check the basic statistics of the numerical columns of the data frame. What is the average length of a human protein? What is the maximum length?  
4. Use the `px.histogram()` functions to create histograms showing the distributions of the protein lengths and log-lengths. Which distribution is more spread around its average? Does any distribution resemble the Normal (Gaussian) distribution? Are there many proteins with lengths similar to the maximum length, or just a few?  
5. Calculate the average length and log-length and their standard deviations; Store them in variables `true_mean`, `true_mean_log`, `true_std` and `true_std_log`. We'll use them in subsequent exercises as our *ground truth* against which we'll evaluate our estimators.     

*Quick question*. Is $\text{true_mean}$ equal to $10^\text{true_mean_log}$? Why/why not? (note that we've used the base-10 logarithm)

*Why the base-10 logarithm?* Mathematicians usually prefer to use the natural logarithms. In statistical data analysis, we sometimes use also the base-10 logarithms, because their values are easier to interpret as orders of magnitude (or simply the numbers of digits) of our values. Although the logarithms are mostly equivalent mathematically, an easier interpretation is important to get more meaningful conclusions from the data.

*Why the standard deviation?* Some students may wonder why we prefer the standard deviation rather than the variance - after all, the difference is just a square root, so mathematically it's almost the same thing. Here, again, the reason is the interpretability of the results. When we compute the variance, we square the observations. As a consequence, their units also get squared. This means that, if we estimate the number of mushrooms in a forest, the variance is expressed in terms of *mushrooms squared*, which doesn't make any sense. Taking a square root brings the unit back to mushrooms.   

## Point estimation

One of the main strengths of statistical theory is that it allows us to estimate many quantities (like the mean protein length, mean income in a country, or voting preferences)  using only a sample of randomly selected observations, and, most importantly, to estimate the uncertainty of such estimation. This is the main reason why we derive properties of estimators, such as their expected value and variance. Good statisticians can derive estimators which need less observations and give better results.   

**Exercise 5.** In this exercise, we'll do an empirical analysis of the properties of the estimator of the mean. We'll use a sample of $N=1000$ randomly selected human proteins. Denote $X_i$ as the length of a randomly selected human protein, and $\log(X_i)$ as its base-10 logarithm. Define the following two estimators:

$$\hat{\mu}_X = \sum_{i=1}^N X_i/N, \text{an estimator of the mean length}$$

$$\hat{\mu}_{\log(X)} = \sum_{i=1}^N \log(X_i)/N, \text{an estimator of the mean log-length}$$  

First, we'll draw $R=2000$ independent samples and calculate the estimators. Here's an example way to do this:   

1. Create empty lists called e.g. `means` and `means_log`.  
2. Repeat the following $R$ times (e.g. using a `for` loop):  
    2.1. Get a random sample of size $N$ of the observations (i.e. rows) from `human_protein_lengths`; you can use the `.sample()` method.   
    2.2. Calculate the mean length and append to `means`  
    2.3. Calculate the mean log-length and append to `means_log`.     
3. Convert both lists to `numpy` arrays (e.g. `means = np.array(means)`)

Now, we can inspect how well the estimators approximate the *true* mean length $\mu_X$ and the *true* mean log-length $\mu_{\log{X}}$ (notice the lack of hats above $\mu$'s - this means that these are the true parameters, not estimators).

4. Estimate the mean value of the estimator of the mean (by running `np.mean(means)`). Is it close to the true value $\mu$? In other words, does the estimator seem *unbiased*?
5. Estimate the bias of the estimator of the mean log-length (using the values in `means_log`). Does it seem biased? Does the result agree with the theoretical one about the estimator of the mean?       
5. Estimate the Root Mean Square Error of the estimator of the mean, defined as $\text{RMSE}(\hat{\mu}_X) = \sqrt{\mathbb{E}(\hat{\mu}_X - \mu_X)^2}$. This will tell you, approximately, the average error of $\hat{\mu}_X$ in terms of the number of amino acids (the building blocks of proteins). Why did I write *approximately*? (Hint: it's not just becasue we estimate it rather than calculate it theoretically)
6. Estimate $\text{RMSE}(\hat{\mu}_{\log(X)})$. How can you interpret the result?
7. Estimate the standard deviations of the estimators. Which one is less variable? Does it mean that one quantity is easier to estimate than the other?
8. Is $\text{sd}(\hat{\mu}_X) = \text{RMSE}(\hat{\mu}_X)$? Why/why not? Is the equation always true, sometimes true, or never true?    


**Exercise 6.** The standard deviation of an unbiased estimator tells us how much it fluctuates around the true value. An estimator with a lower standard deviation will more often give us values that are close to the true one. This way, we can compare two estimators of the same thing.  

However, the standard deviation is often not that useful when comparing the measurements of two different things. This is because it depends on the units of the measurement. Suppose we measure the length $L$ of some objects in meters, and the standard deviation of the measurement is $\text{sd}(L)$. Measuring the same object in centimeters will give us a measurement equal $100L$, and the corresponding standard deviation $\text{sd}(100L) = 100\text{sd}(L)$ will appear to be much larger, but it doesn't mean that measurements in centimeters are more difficult. To make matters worse, in real-life applications, the variability of the measurement often depends on its average value, regardless of the units. The standard deviation of the height of a mouse (a few milimeters) is much lower than the one of an elephant (several centimeters), but it doesn't mean that mice are easier to measure. Similarly, in the case of protein length and log-length, the latter is much smaller, so it can be expected that its standard deviation will be smaller as well.

To evaluate the variability of an estimator regardless of its units and the average value, we can calculate a so-called [*coefficient of variation*](https://en.wikipedia.org/wiki/Coefficient_of_variation) (variation, not variance!). For a random value $Y$, this is defined as $\text{cv}(Y) = \text{sd}(Y)/\mathbb{E}(Y)$.   

7. Calculate the coefficients of variation for the estimators of mean protein length, i.e. $\text{cv}(\hat{\mu}_X)$, and log-length, i.e. $\text{cv}(\hat{\mu}_{\log(X)})$. Which estimator is better in this case? In general, does a lower coefficient of variation always mean that an estimator is better?
8. Is $\text{cv}(Y)$ always equal to  $\text{sd}(Y/\mathbb{E}(Y))$? Is there a condition for $Y$ that makes it equal? Give an analytical argument and verify that empirically on the protein length data.





**Exercise 7.** Remember that the estimators are random variables with their own distributions! We can explore their distributions visually.  

4. Use the `px.histogram()` function to generate histograms of the estimator of the mean $\hat{\mu}_X$ and the log-mean $\hat{\mu}_{\log(X)}$.
5. Annotate the histograms with the true values $\mu_X$ and $\mu_{\log{X}}$ that you have computed in Exercise 1. You can do the annotation any way you want, I typically use a red dot at the bottom of the histogram or a vertical line. Are the estimators centered around the true values? Which estimator is more focused (i.e. less spread) around the true value?
6. Annotate the histograms with the average values of the estimators that you have computed in Exercise 3. Use different colors than in the previous point.
7. Is the distribution of the estimator $\hat{\mu}_{\log(X)}$ similar to the distribution of the protein log-lengths that you visualized in Exercise 1?
6. Is the distribution of the estimator $\hat{\mu}_X$ similar to the distribution of the protein lengths that you visualized in Exercise 1? Or maybe it's more similar to the normal distribution now? Why?


**Exercise 8.** We've learned how to analyze the properties of an estimator for a fixed sample size $N$, and we can use this knowledge to do something even more useful: determining how many observations we need for an estimation with a given precision.

1. Using the equations for the expected value and the standard deviation of the estimator of the mean, $\hat{\mu}_X = \sum_{i=1}^N X_i/N$ where $X_i$ is the length of a randomly selected protein, calculate the sample size $N^*$ that we need to take in order for the standard deviation of the estimator to be equal to a fraction $p$ of the true mean (i.e. $\sigma_\hat{\mu} = p\mu$; note the lack of a hat above sigma - it's not a random variable, but a parameter of the estimator). Express it in terms of coefficients of variability. You will need to assume that the standard deviation of the protein length is known; use the value in the `true_std` variable from Exercise 1.
2. Calculate $N^*$ for the estimator of the average log-length (use the true standard deviation of protein log-lengths in the `true_std_log` variable). Is there a noticeable difference compared to the estimator of the average length? Which quantity is easier to estimate?   
2. Analyze one of the estimators for a sample of size of the corresponding $N^*$: visualize its distribution, estimate its bias, standard deviation, coefficient of variation and RMSE. You can simply modify the code from the previous exercises.   
  2.1 Use the results to verify if your calculation of $N^*$ was correct.  
  2.2 How did the distribution of the estimator change compared to the previous sample size ($N = 1000$)?  
3. *Quick question 1*. Does $N^*$ depend on the number of proteins that humans have? In order to get the same precision of the estimation (measured in terms of the RMSE), would you need a larger sample size if humans had a million proteins?  
4. *Quick question 2.* Does $N^*$ depend on the distribution of the data?


*Note.* In practice, the estimation of $N^*$ is often more difficult, because we usually don't know the true standard deviation; instead, we need to estimate it and take into account the error of this estimation when deriving $N^*$. Because of this, the required sample size will typically be larger than in the case of a known standard deviation. This topic is too complex to cover in this course, so we'll focus on the simpler case with a known standard deviation.


**Exercise 9.** Can we use the estimated average log-length to estimate the average length of human proteins?  

1. Consider a statistic given by the equation $\hat{\zeta} = 10^{\hat{\mu}_{\log(X)}}$. Is it an estimator of the average income?  
  1.1 Does $\hat{\zeta}$ correspond to some well-known mathematical object, e.g. some kind of mean?
2. Regardless of the answer, let's try to use $\hat{\zeta}$ to estimate the average protein length. Use the randomly sampled values of $\hat{\mu}_{\log(X)}$ from the previous exercises to calculate the corresponding values of $\hat{\zeta}$ and to estimate this estimator's expected value and standard deviation. You can use a sample size of your choice; try to compare the results for different sample sizes, like $N = 10, 100, 1000$.  
3. Based on the results, do you think that $\hat{\zeta}$ is an unbiased estimator of the average length? Try to confirm your expectations by deriving formulas for the expected value of $\hat{\zeta}$.  
  3.1.\* If $\hat{\zeta}$ is biased, then how does the bias scale with the number of observations? Check this either theoretically by analyzing equations or empirically by estimating the bias for different sample sizes (e.g. create a plot showing the estimated bias depending on the sample size).   
4. Does $\hat{\zeta}$ have a lower or a higher variance than $\hat{\mu}_{X}$? What about the coefficient of variability?   
5. Plot the values of the estimators $\hat{\zeta}$ and $\hat{\mu}_I$ on two boxplots side-by-side and annotate it with the true average income (for example, using a horizontal line). Which estimator seems better?  
6. Which estimator has a lower RMSE: $\hat{\zeta}$ or $\hat{\mu}_X$? Why?  
7.\* Do we have $\text{RMSE}(\hat{\zeta}) \geq \text{RMSE}(\hat{\mu}_X)$ for all sample sizes? If not, then try to characterize the sample sizes for which  $\hat{\zeta}$ works better than $\hat{\mu}_X$.  



## Interval estimation

In the previous section, we've learned how to quantify and analyze the uncertainty of an estimation by analyzing the standard deviation of the estimator. In this section, we will learn a different technique - the estimation of *confidence intervals*, i.e. intervals which are likely to contain the true value of the parameter of interest.

In general, we say that a confidence interval $[A, B]$ for a parameter $\theta$ has a confidence level $\alpha$ if it contains the true value of the parameter $\theta$ with probability $\alpha$:

$$\mathbb{P}(A \leq \theta \leq B) = \alpha$$

Above, $A$ and $B$ are random variables calculated from the data (i.e. $A$ and $B$ are *statistics*). Note: some authors use a different terminology and would call this a level $1-\alpha$; some authors also use a more general definition with $\mathbb{P} \geq \alpha$ instead of $\mathbb{P} = \alpha$, because we often can't determine the exact probability and can only give its lower bound (you've seen this in the lecture with the Chebyshev confidence intervals).

In principle, we can construct confidence intervals for any parameter of any distribution (e.g. the expected value, the variance, the proportion in the Bernoulli distribution, the shape parameter of the Gamma distribution etc.), but this is often difficult in practice. We'll focus on confidence intervals for the true mean (i.e. the expected value) of a normally distributed population - this is one of the most commonly used and one of the most useful confidence intervals.

For the expected value of a normally distributed random variable, there are two commonly used confidence intervals: the conficence interval for a *known* $\sigma$, given by the equation

$$\left (\hat{\mu} - q_{(1+\alpha)/2}\frac{\sigma}{\sqrt{N}},\quad \hat{\mu} + q_{(1+\alpha)/2}\frac{\sigma}{\sqrt{N}} \right ), $$
where $q_{(1+\alpha)/2}$ is the quantile of the standard normal distribution at the level of $(1+\alpha)/2$; and the confidence interval for an *unknown* $\sigma$, given by the equation

$$\left (\hat{\mu} - t_{(1+\alpha)/2, N-1}\frac{\hat{\sigma}}{\sqrt{N}},\quad \hat{\mu} + t_{(1+\alpha)/2, N-1}\frac{\hat{\sigma}}{\sqrt{N}} \right ), $$
where $t{(1+\alpha)/2, N-1}$ is the quantile of the Student's $t$ distribution with $N-1$ degrees of freedom at the level of $(1+\alpha)/2$, and $\hat{\sigma}$ is the square root of the **unbiased** estimator of the variance, i.e. $\hat{\sigma} = \sqrt{\sum_{i=1}^N (X_i - \bar{X})^2/(N-1)}$, where $\bar{X} = \hat{\mu} = \sum_{i=1}^N X_i$.

If we simply plug $\hat{\sigma}$ instead of $\sigma$ in the first kind of the confidence interval (the one for a known $\sigma$), we get a third type of a confidence interval, a so-called *asymptotic confidence interval* for the mean; the name *asymptotic* comes from the fact that $\hat{\sigma} → \sigma$ as $N → ∞$. As a consequence of this convergence, the asymptotic confidence interval gives quite accurate results for large sample sizes.  

**Exercise 10.** In this exercise, we'll do an empirical comparison of the properties of the three types of confidence intervals using the protein log-length data. We'll sample $R$ samples of some size $N$, calculate the corresponding confidence intervals for the mean, and check whether they have the desired confidence level and compare their lengths.   

First, we'll prepare our data.    
1. Create empty lists (or `numpy arrays`) that will contain the information whether the true mean is within a confidence interval (e.g. `within_normal` for the confidence interval with a known $\sigma$, `within_student` for the confidence interval with an unknown $\sigma$, `witin_asymptotic` for the asymptotic confidence interval).   
2. Create empty lists (or `numpy arrays`) that will contain the lengths of the intervals.    
3. Repeat the following $R=1000$ times (or more):  
  3.1. Select a random sample of size $N$ of protein log-lengths; select $N$ of your choice.   
  3.2. Calculate the three confidence intervals on the confidence level 95%. For the quantiles, you can use the `norm.ppf` and `t.ppf` functions from the `scipy.stats` package. For the normal confidence interval, use the known standard deviation in `true_mean_log`. Pay attention to the type of the estimator of standard deviation that you use! Some packages use the unbiased estimator of the variance, some don't!  
  3.3. Calculate the lengths of the confidence intevals and append them to the corresponding lists.
  3.4. Check whether the confidence intervals contain the true average log-length $\mu_{\log(X)}$, append the information to the corresponding lists.  

Now, we'll use the generated data to analyze the properties of the confidence intervals.

4. For each type of the confidence interval, estimate the probability that it contains $\mu_{\log(X)}$. Is the estimated probability close to the desired confidence level for each type? Why/why not? Does the answer depend on $N$?  
5. Calculate the average length of each type of the confidence interval. Which type tends to give the shortest intervals? Which type tends to give the longest? Why?    
6. Plot histograms depicting the distribution of the lengths of the confidence intervals.
7. What are the advantages and disadvantages of each type of the confidence interval? Does the asymptotic confidence interval have any advantages over the other two?   
8. *Quick question.* For a single sample, do all of the three types of confidence intervals always contain $\hat{\mu}_{\log(X)}$?  

Repeat this exericise using the protein lengths instead of log-lengths. What went wrong and why? Does the answer depend on $N$?      


**Exercise 11.** The length of the confidence interval is another (and more common) way to determine the required sample size. In this exercise, we'll see how to use it. As in exercise 6, we'll focus on the case of a known standard deviation.  

1. Using the formula for the confidence interval with a known standard deviation, derive a formula for a necessary sample size $N^*$ such that the length of the confidence interval is at most some value $l$.  
2. Calculate the required sample size for the average log-length for $l = 0.3$ (approximately 10% of the true mean).   
3. Select a sample of the size calculated in the previous point and calculate an example confidence interval. Check if its length really is at most $l$.  
  3.1. *Quick question 1.* Is the length of this confidence interval a random variable? Does it depend on the random sample?     
4. *Quick question 2.* Do your calculations work for protein lengths as well?  
5. Take a look at the formula for the confidence interval for an unknown standard deviation. Can you see why it may be difficult to derive a formula for $N^*$ in this case?  
  5.1.\* How would you approach calculating $N^*$ in this case?     

**Exercise 12.\*** In this exercise, we'll see how the three types of the confidence intervals for the mean change with the sample size $N$. We'll also see some more details about their variability. We'll work on the protein log-length data.     

1. Create variables to store the lower and upper bounds of each of the three types of the confidence intervals for the mean on a selected confidence level.
2. For each value of $N$ from 3 to 100:  
  2.1. Draw a random sample of size $N$.  
  2.2. Calculate the three confidence intervals and store their bounds.  
3. Create line plots with $N$ on the x axis and with the lower and upper bounds of the confidence intervals on the y axis.  
  3.1.\* For best results, try to create a single plot that shows all three types of confidence intervals. Make sure that each type of a confidence interval gets assigned a single, distinct color. To do this, you may need to import additional modules from `plotly`.    
4. Annotate the plots with a horizontal line corresponding to the true mean.  
5. Based on the plots, can you describe how the length of the confidence intervals changes with an increasing sample size? (this question is about the *trend*, i.e. the average change)    
6. Which type of confidence intervals has the most variability, and which has the least? Why? (this question is about the variability *around the trend*, i.e. on top of the average change)   
7. Where can you observe the largest differences between the three types of the confidence intervals?  
8. Can you see how all the three types become more and more equivalent as the sample size grows? Why does that happen?  
9. Does increasing the sample size always result in the same increase in the precision of the estimation?  

**Exercise 13.** A researcher has measured the blood glucose concentration (BGC) on a well-selected sample of 1000 individuals. The researcher has determined that BGC is normally distributed and that the 95% confidence interval for the average GBC is [192 mg/dl, 203 mg/dl]. Next, the researcher has run the following further studies. Evaluate whether they are correct, and if not, try to suggest corrections. Support your claims with analytical proofs or numerical simulations.

- The researcher has tested a new, experimental methodology for measuring BGC on a well-selected selected sample of another 1000 individuals. The average BGC measured this way turned out to be 205 mg/dl. The researcher has concluded that the new methodology is biased.  
- The researcher has tested a randomly selected group of 100 people who visited fast-food restaurants more than 5 times in the previous month. The confidence interval for BGC turned out to be [205 mg/dl, 210 mg/dl]. The researcher has concluded that frequently eating fast food increases the blood glucose concentration.  
- The researcher has measured BGC on another set of 1000 individuals. The 95% confidence interval for this group was equal [198 mg/dl, 208 mg/dl]. The researcher has concluded that the average BGC in the whole population is most likely within [198 mg/dl, 203 mg/dl].   