# **Applied Statistics Worksheet 3: Confidence Intervals**

***SEMT20002 â€” Nikolai Bode***

This worksheet focusses on the material covered in Lecture 3: quantifying the uncertainty of an estimated quantity.

The instructions for what you are expected to do are either shown in the text or given as comments in the code boxes.


## **Packages**

Whilst packages are mentioned (and used in the solutions provided), please do not simply run code to import these packages, in case some import packages are not compatible with what you have already installed.

From solutions, only copy the relevant lines into your own file and then execute, checking functionality with existing installed packages, and only importing new packages when needed. For plotting results in particular, we suggest you focus on using functions you know already rather than trying texactlyly replicate the plots shown in lectures or in solution

For this worksheet, the functionalities required are contained within `scipy.stolution.

## **1. Z-CI and t-CI for the mean of a Normal RV**

(CI = Confidence Interval)

Consider a normally distributed RV, $ X $, such that $ X \sim N(0, 1) $.

1. Generate 10000 sets of samples from $ X $, each of size $ n $, and plot the sampling distribution of the mean for different values of the sample size $ n \in [5, 20, 100] $.
2. Use the empirical quantiles to compute the empirical CIs for the various $ n $ for a 95% confidence level $ \alpha = 0.05 $.
3. Build the $ Z $-CI (Lecture 3 Eq.(1)) and $ t $-CI (Lecture 3 Eq.(2)) for each of the samples, using the formulae. Is the length of the empirical CIs close to that of the $ Z $-CIs? And of the $ t $-CIs?
4. How many estimated $ Z $-CIs contain the true value? (For how many estimates does the true value fall inside the $ Z $-CI?) and the $ t $-CIs?

***Solution***

## **2. CI for the mean of a Bernoulli RV & the Central Limit Theorem**

1. **Central limit theorem (CLT):** generate samples from a Bernoulli RV with $ p = 0.4 $ and plot the sampling distribution of the mean for different values of the sample size $ n $. Show that the sampling distribution approaches a Normal distribution for large $ n $.

2. Use the empirical quantiles to compute the empirical CIs for the various $ n $ for a confidence level $ \alpha = 0.05 $.

3. Build the $ Z $-CI for each sample using the formula Eq.(1) of Lecture 3. Is the mean length of the $ Z $-CIs close to that of the empirical CIs?

4. How many estimated $ Z $-CI contain the true value? (For how many estimates does the true value fall inside the $ Z $-CI?)

5. Build the $ t $-CI for each sample using the formula Eq.(2) of Lecture 3 and compute the mean lengths and the fraction that contain the true value. What do you observe?

## **3. Exercises**
**Fruit Flies**

An experiment was conducted to determine the effectiveness of heat treatment to kill fruit fly eggs in mangoes. From 5903 eggs in treated mangoes, 637 adults hatched. What is the probability $p$ that an egg will survive the heat treatment?

1. Work out an estimate for $p$.
2. Work out a standard deviation for the estimate $p$.
3. Work out a 99% confidence interval for $p$.

***Solution***

**Call Centre**

The call centre for a bank samples $n = 58$ incoming phone calls and records the time taken to answer each. It is found that the average call time is 99 seconds and the variance is estimated to be 576 sec. Find a 90% confidence interval for the mean call time. (List the assumptions you make.)

***Solution***

## **4. When the t-CI fails: the mean of a Pareto distribution**

1. Generate samples from a Pareto distribution $ P_X(x; \theta) = \frac{\theta}{x^{\theta+1}} $ for $ x > 1 $ with $ \theta = 1.5 $ (use the code `x = np.random.rand(n, S) ^ (-1/theta);` from Lab 2) and plot the sampling distribution of the mean for different values of the sample size $ n $. Is the sampling distribution symmetric for small $ n $? Does it approach a Normal distribution for large $ n $?

2. Use the empirical quantiles of the sampling distribution to compute the empirical CIs of the mean for the various $ n $ for a confidence level $ \alpha = 0.05 $. Are the CIs symmetric with respect to the true value? (why?)

3. Build the t-CI for each sample. Is the mean length of the t-CIs close to that of the empirical CIs?

4. How many estimated t-CI contain the true value? (For how many estimates does the true value fall inside the t-CI?)

***Solution***

## **5. Bootstrapping**

1. Plot the empirical sampling distribution for the mean of a Normal RV with $ \mu = 0 $ and $ \sigma = 1 $ (see Exercise 1) generating $ S = 1000 $ samples of size $ n = 5 $.
2. Plot the bootstrap distribution obtained generating bootstrap samples, where each sample is generated by drawing $ n $ elements with replacement from an original sample of size $ n $.
3. Compare the sampling and bootstrap distributions: do they have the same mean? do they have the same spread? Do your answers change if you vary $ n $?
4. Compute the bootstrap-CI for a 50% confidence level and add 3 vertical lines (one for the true value and two for the boot-CI) to the plot with the sampling and bootstrap distribution.
5. Repeat steps 1-4 a few times and count the fraction of boot-CI that don't contain the true value: are they roughly 50%?
6. Write a for loop that repeats steps 2-4 1000 times for a confidence level of 95%: (a) do boot-CIs and empirical CIs have similar length? (b) How many boot-CIs contain the true value? What happens when $ n $ increases?

***Solution***

**Bootstrap-CI for the standard deviation of a population**

Use the bootstrap percentile CI to estimate the standard deviation of student heights at 95% confidence level.

The measured heights of random students are:

`d = [176, 165, 189, 180, 172, 169, 162, 161, 183, 170]`

***Solution***