# Week 7 - Distributions of Sampling Statistics

This is a Jupyter notebook to explore the material in (Ross, 2017, Chp. 7). 



In [1]:
%matplotlib inline
# from now on we'll start each notebook with the library imports
# and special commands to keep these things in one place (which
# is good practice). The line above is jupyter command to get 
# matplotlib to plot inline (between cells)
# Next we import the libraries and give them short names
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
from collections import Counter
from collections import defaultdict

## Exercise A

Complete question 3 from problems for (Ross, 2017, Sec. 7.3) -- the text is repeated below for convenience:

> 3. Consider a population whose probabilities are given by
>
> $$p(1) = p(2) = p(3) = \frac{1}{3}$$
>
>    (a) Determine E[X].
>
>    (b) Determine SD(X).
>
>    (c) Let X denote the sample mean of a sample of size 2 from this
population. Determine the possible values of X along with their
probabilities.
>
>    (d) Use the result of part (c) to compute E[X] and SD(X).
>
>    (e) Are your answers consistent?


*complete your answers in Markdown*

>    (a) Determine E[X].

$$E[X] = \frac{1}{3}\cdot 1 + \frac{1}{3}\cdot 2 + \frac{1}{3}\cdot 3 = \frac{6}{3} = 2$$

>    (b) Determine SD(X).

$$\begin{aligned}
var(X)
& = E[X^2] - E[X]^2 \\
& = \left(\frac{1}{3}\cdot 1 + \frac{1}{3}\cdot 4 + \frac{1}{3}\cdot 9\right) - 2^2\\
& = \frac{14}{3} - 4 = \frac{2}{3}
\end{aligned}
$$

$$SD(X) = \sqrt{var(X)} = \sqrt{\frac{2}{3}} = 0.816$$

>    (c) Let X denote the sample mean of a sample of size 2 from this
population. Determine the possible values of X along with their
probabilities.

To calculate this, we consider all the possible outcomes given two draws, $X_1$ and $X_2$ from the original distribution. Then for each of the possible means, $\bar{X} = (X_1 + X_2)/2$, it is sufficient to count the number of outcomes, and divide by the total number of outcomes. Alterntaively, you can sum the probabilities of the indepedent outcomes that give each sum.

All possible values of the mean $\bar{X}$ with associated probabilities are given in the table:

| $\bar{X}$   | prob |
| :---: | :-------: |
| 1     | $\frac{1}{9}$   |
| 1.5     | $\frac{2}{9}$   |
| 2     | $\frac{3}{9}$   |
| 2.5     | $\frac{2}{9}$   |
| 3     | $\frac{1}{9}$   |


>    (d) Use the result of part (c) to compute E[X] and SD(X).

$$E[\bar{X}] = \left(\frac{1}{9}\cdot 1
  + \frac{2}{9}\cdot 1.5
  + \frac{3}{9}\cdot 2
  + \frac{2}{9}\cdot 2.5
  + \frac{1}{9}\cdot 3\right) =\frac{36}{9} = 2$$
  
$$E[\bar{X}^2] = \left(\frac{1}{9}\cdot 1
  + \frac{2}{9}\cdot1.5^2
  + \frac{3}{9}\cdot 2^2
  + \frac{2}{9}\cdot 2.5^2
  + \frac{1}{9}\cdot 3^2\right) =\frac{39}{9} = \frac{13}{3}$$
  
$$SD(X) = \sqrt{var(\bar{X})} = \sqrt{E[X^2] - E[X]^2} = \sqrt{\frac{13}{3}-2^2} = 0.577$$
  

>    (e) Are your answers consistent?

Using the equation for the expected value of a sample mean of size $n$, we have:

$$E[\bar{X}] = E[X]$$

Clearly this holds.

Using the equation for the SD of a sample mean of size $n$ we have:

$$SD(\bar{X}) = SD(X)/\sqrt{n}$$

In this case:

$$SD(X)/\sqrt{n} = 0.816/\sqrt{2} = 0.577 = SD(\bar{X})$$ (as required)

## Exercise B

Complete question 5 from problems for (Ross, 2017, Sec. 7.4) -- the text is repeated below for convenience:

> 5. The time it takes to develop a photographic print is a random variable
with mean 17 seconds and standard deviation 0.8 seconds. Approximate
the probability that the total amount of time that it takes to process 100
prints is
>
>    (a) More than 1720 seconds
>
>    (b) Between 1690 and 1710 seconds

*Write up in markdown but you may want to use the code block below to complete your calculations.*

>    (a) More than 1720 seconds

The time for the $i$th photographic print to develop is $X_i$ distributed with $E[X_i] = 17 secs$ and $SD(X_i) = 0.8 secs$. The sum random variable $X= \sum_{i=1}^{100} X_i$ is approximately normal with $E[X] = 1700 secs $ and $SD(X) = 0.8\sqrt{100} = 8 secs$.

$$
\begin{aligned}
\Pr(X > 1720)
&= \Pr\left(\frac{X-E[X]}{SD(X)} > \frac{1720-E[X]}{SD(X)}\right)\\
&= \Pr\left(Z > \frac{1720-1700}{8}\right)\\
&= \Pr\left(Z > \frac{20}{8}\right) \\
&= \Pr(Z > 2.5) = 1 - 0.9938 = 0.0062
\end{aligned}$$



>    (b) Between 1690 and 1710 seconds

$$
\begin{aligned}
\Pr( 1690 < X < 1710)
&= \Pr\left( - \frac{5}{4} < Z < \frac{5}{4}\right)\\
&= \Pr\left(Z < 1.25\right) - \Pr\left(Z < -1.25\right) \\
&= 0.8944 - (1 - 0.8944) = 0.7888
\end{aligned}$$



In [43]:
## supporting code for Exercise B

# part a
print("Part (a)")
E_X = 17*100
print(f"E[X] = {E_X}")
sd_X = 0.8*np.sqrt(100)
print(f"SD(X) = {sd_X}")
za = (1720 - E_X)/sd_X
print(f"za = {za}")
# from table D.1
print(f"Pr(Z > {za}) = {1 - 0.9938:.4f}")
# using scipy
print(f"Pr(Z > {za}) = {1 - rv.cdf(za):.4f}")
rv = stats.norm()


print("Part (b)")
# from table D.1
print(f"Pr(-1.25 < Z < 1.25) = {0.8944 - (1 - 0.8944):.4f}")
# using scipy
print(f"Pr(-1.25 < Z < 1.25) = {rv.cdf(1.25)- rv.cdf(-1.25):.4f}")

Part (a)
E[X] = 1700
SD(X) = 8.0
za = 2.5
Pr(Z > 2.5) = 0.0062
Pr(Z > 2.5) = 0.0062
Part (b)
Pr(-1.25 < Z < 1.25) = 0.7888
Pr(-1.25 < Z < 1.25) = 0.7887


## Exercise C

Complete questions 2 and 3 from problems for (Ross, 2017, Sec. 7.5) -- the text is repeated below for convenience:

> 2. Ten percent of all electrical batteries are defective. In a random selection of
8 of these batteries, find the probability that
>
>    (a) There are no defective batteries.
>
>    (b) More than 15 percent of the batteries are defective.
>
>    (c) Between 8 and 12 percent of the batteries are defective.

> 3. Suppose there was a random selection of n = 50 batteries in Prob. 2. Determine approximate probabilities for parts (a), (b), and (c) of that problem.

*complete in markdown but you can use the code block below for any calculations*


> 2. (a) There are no defective batteries.

We are chosing 8 batteries each having an independent probability of a defect of  $0.1$. Therefore the distribution of the number of defective batteries in the sample, $X$, is $\text{Binomial}(0.1,8)$. The probability that $X$ takes value $0$ is:

$$\Pr(X = 0) = \left(\begin{array}{c}8 \\ 0\end{array}\right) 0.1^0 (1-0.1)^8 = 0.9^8 = 0.4305$$

> 2. (b) More than 15 percent of the batteries are defective.

As the sample is so small $15\%$ needs to be translated into a meaningful fraction. $15\%$ of 8 is $1.2$, so $2$ or more batteries in the sample must be defective for this to be true.

$$\Pr(X >= 2) = 1 - \Pr(X=0) - \Pr(X=1)$$

$$\Pr(X=1) = \left(\begin{array}{c}8 \\ 1\end{array}\right) 0.1^1 (1-0.1)^7 = 8 \cdot 0.1^1 \cdot 0.9^7 = 0.3826$$

And so,
$$\Pr(X >= 2) = 1 - 0.4305 - 0.3826 = 0.1869$$

> 2. (c) Between 8 and 12 percent of the batteries are defective.

As there is a sample of 8 batteries we must interpret these proportions in terms of that. $8\%$ of 8 is $0.64$ and $12\%$ of 8 is $0.92$. As we cannot have fraction numbers of defective batteries, we will never have between $8\%$  and $12\%$ of the sample and so:

$$\Pr(\text{Between $8\%$ and $12\%$ of sample is defective}) = \Pr(0.64 < X < 0.8) = 0$$

> 3. (a) There are no defective batteries.

We now are chosing 50 batteries each having an independent probability of a defect of  $0.1$. Therefore the distribution of the number of defective batteries in the sample, $X$, is $\text{Binomial}(0.1,50)$. The probability that $X$ takes value $0$ is:

$$\Pr(X = 0) = \left(\begin{array}{c}50 \\ 0\end{array}\right) 0.1^0 (1-0.1)^{50} = 0.0052$$

> 3. (b) More than 15 percent of the batteries are defective.

As the sample is small $15\%$ needs to be translated into a meaningful fraction. $15\%$ of 50 is $7.5$. Therefore we must have 8 or more batteries as defective for this to be true. However, as the sample is large enough we can use the normal approximation to estimate the chance of this.

The mean of our sum random variable, $X=X_1 + \ldots + X_{50}$ is $E[X] = \mu = np = 50\cdot 0.1 = 5$, the standard deviation is $SD(X) = \sigma = \sqrt{np(1-p)} = \sqrt{50\cdot0.1\cdot 0.9} = 8$. Therefore

$$\Pr(\text{More than $15\%$ of sample defective}) = \Pr(X \geq 8) = \Pr\left(Z \geq \frac{8-\mu}{\sigma}\right) = 0.0793$$


> 3. (c) Between 8 and 12 percent of the batteries are defective.

$8\%$ of 50 is $4$ and $12\%$ of 50 is $6$ -- round numbers so we can use these directly, and

$$\Pr(\text{Between $8\%$ and $12\%$ of sample is defective}) = \Pr(4 \leq X \leq 6)= \Pr(X \leq 6) - \Pr(X \leq 4)$$

$$\Pr(X \leq 6) = \Pr(Z \geq \frac{6-\mu}{\sigma}) = \Pr(Z \leq 0.47) = 0.6808$$
$$\Pr(X \leq 4) = \Pr(Z \geq \frac{4-\mu}{\sigma}) = \Pr(Z \leq -0.47) = 1-0.6808$$

$$\Pr(4 \leq X \leq 6) = 0.6808 - (1-06808) = 0.3616$$


In [79]:
## supporting code for exercise C
p = 0.1

# question 2
print("Question 2:")
p0 = 0.9**8
print(f"Pr(X=0) = {p0:.4f}")

print(f"15% of 8 is {0.15*8}  - not a round number")
xlower = int(np.ceil(0.15*8))
print(f"For more than 15% of 8 batteries to be defective then {xlower} or more batteries in the sample must be defective")
p1 = 8 * 0.1 * 0.9**7
print(f"Pr(X=1) = {p1:.4f}")
print(f"Pr(X >= 2) = 1 - Pr(X=0) - Pr(X=1) = {1-p0-p1:.4f}")
print()

# question 3
print("Question 3:")
q0 = 0.9**50
print(f"Pr(X=0) = {q0:.4f}")

n = 50
threshold = n*0.15
minnumber = int(np.ceil(threshold))
print(f"15% of 50 is {threshold} so more than {minnumber} must be defective.")
E_X = n*p
SD_X = np.sqrt(n*p*(1-p))
print(f"E[X] = {E_X}")
print(f"SD(X) = {SD_X:.2f}")
za = (minnumber-E_X)/SD_X
print(f"Pr(More than 15% of sample defective) = Pr(X >= {minnumber}) = Pr(Z >= ({minnumber}-{E_X})/{SD_X:.2f})")
print(f"                                      = Pr(Z >= {za:.2f}) = 1-{0.9207} = {1-0.9207:.4f}")

#
thresholdc1 = 50*0.08
thresholdc2 = 50*0.12
print(f"Pr(Between 8% and 12% of sample defective) = Pr({thresholdc1} <= X <= {thresholdc2}")
zc1 = (thresholdc1-E_X)/SD_X
zc2 = (thresholdc2-E_X)/SD_X
print(f"                                           = Pr({zc1:.2f} <= Z <= {zc2:.2f})")
print(f"                                           = {0.6808} - (1-{0.6808})= {0.6808 - (1-0.6808):.4f}")



Question 2:
Pr(X=0) = 0.4305
15% of 8 is 1.2  - not a round number
For more than 15% of 8 batteries to be defective then 2 or more batteries in the sample must be defective
Pr(X=1) = 0.3826
Pr(X >= 2) = 1 - Pr(X=0) - Pr(X=1) = 0.1869

Question 3:
Pr(X=0) = 0.0052
15% of 50 is 7.5 so more than 8 must be defective.
E[X] = 5.0
SD(X) = 2.12
Pr(More than 15% of sample defective) = Pr(X >= 8) = Pr(Z >= (8-5.0)/2.12)
                                      = Pr(Z >= 1.41) = 1-0.9207 = 0.0793
Pr(Between 8% and 12% of sample defective) = Pr(4.0 <= X <= 6.0
                                           = Pr(-0.47 <= Z <= 0.47)
                                           = 0.6808 - (1-0.6808)= 0.3616


## Exercise D

Complete question 1 from problems for (Ross, 2017, Sec. 7.6) -- the text is repeated below for convenience:

> 1. The following data sets come from normal populations whose standard
deviation σ is specified. In each case, determine the value of a statistic
whose distribution is chi-squared, and tell how many degrees of freedom
this distribution has.
>
>    (a) 104, 110, 100, 98, 106; σ = 4
>
>    (b) 1.2, 1.6, 2.0, 1.5, 1.3, 1.8; σ = 0.5
>
>    (c) 12.4, 14.0, 16.0; σ = 2.4

In [88]:
## complete in python

print("Part (a)")
dataa = np.array([ 104, 110, 100, 98, 106])
sigmaa = 4
na = dataa.size
Xbara = np.mean(dataa)
print(f"Sample mean: Xbar = {Xbara}")
S2a = np.sum((dataa-Xbara)**2)/(na-1)
Chi2a = np.sum((dataa-Xbara)**2)/sigmaa**2
print(f"S^2 = {S2a}")
print(f"Chi^2 = {Chi2a} with {na-1} degrees of freedom")
print()

print("Part (b)")
datab = np.array([ 1.2, 1.6, 2.0, 1.5, 1.3, 1.8])
sigmab = 0.5
nb = datab.size
Xbarb = np.mean(datab)
print(f"Sample mean: Xbar = {Xbarb}")
S2b = np.sum((datab-Xbarb)**2)/(nb-1)
Chi2b = np.sum((datab-Xbarb)**2)/sigmab**2
print(f"S^2 = {S2b}")
print(f"Chi^2 = {Chi2b} with {nb-1} degrees of freedom")
print()

print("Part (c)")
datac = np.array([ 12.4, 14.0, 16.0])
sigmac = 2.4
nc = datac.size
Xbarc = np.mean(datac)
print(f"Sample mean: Xbar = {Xbarc}")
S2c = np.sum((datac-Xbarc)**2)/(nc-1)
Chi2c = np.sum((datac-Xbarc)**2)/sigmac**2
print(f"S^2 = {S2c}")
print(f"Chi^2 = {Chi2c:.2f} with {nc-1} degrees of freedom")

Part (a)
Sample mean: Xbar = 103.6
S^2 = 22.8
Chi^2 = 5.7 with 4 degrees of freedom

Part (b)
Sample mean: Xbar = 1.5666666666666667
S^2 = 0.09066666666666666
Chi^2 = 1.8133333333333332 with 5 degrees of freedom

Part (c)
Sample mean: Xbar = 14.133333333333333
S^2 = 3.2533333333333325
Chi^2 = 1.13 with 2 degrees of freedom
