## Central Limit Theorem, Setting up of confidence intervals, Setting up hypothesis 

In [6]:
import  scipy.stats                     as  stats
import  statsmodels.stats.proportion    as  SMP 
import  numpy                           as  np
import  pandas                          as  pd

### Central Limit Theorem

* Central Limit Theorem (CLT) is one of the most important theorems in Statistics due to its applications in testing of hypothesis.

* CLT states that for a large sample drawn from a population with mean $\mu$ and standard deviation $\sigma$, the sampling distribution of mean, follows an approximate normal distribution with mean, μ and standard error σ / √(n) irrespective of the distribution of the population for large sample size.

* Let S1, S2, ..., Sk be samples of size n, drawn from an independent and identically distributed population with mean, μ and standard deviation, σ. 

* Let be $\overline{X_1}$, $\overline{X_2}$, ..., $\overline{X_k}$,  the sample means of the samples (S1, S2, ..., Sk ). 

* According to CLT, the distribution of $\overline{X_1}$, $\overline{X_2}$, ..., $\overline{X_k}$, follows a normal distribution with mean, μ and standard deviation, σ / √(n) for large value of n. 

* As a general rule, statisticians have found that for many population distribution, when the sample size is at least 30, the sampling distribution of the mean is approximately **normal.**

### Implications of CLT
* 1) X - μ / (σ / √(n) ~ N(0,1)
* 2) If Sn = X1 + X2 + ... + Xn, then E(Sn) = nμ and Standard Error is σ √(n)
* The random variable (Sn - nμ ) / (σ √(n))  is a standard normal variate


### Example 1: 

A hospital is interested in estimating the average time it takes to discharge a patient after the doctor signs the discharge summary sheet. 

Calculate the required sample size at a confidence level of 95%. Assume that the population standard deviation is 25 minutes.

### Solution:


From the CLT (Central Limit theorem), we know that the sampling distribution of the mean follows a normal distribution with mean $\mu$ and standard deviation $\frac{\sigma}{\sqrt{n}}$.  

**Standard Normal Variate Z = $\frac{\overline{X},- \mu} {\frac{\sigma}{\sqrt{n}}}$**

So, $\sqrt{n} = {\frac{Z \quad \times \quad \sigma} {\overline{X} - \mu}}$

* Let  Diff = $\overline{X}- \mu$ 
* Diff = 5 minutes
* $\sigma = 25 $
* $\alpha = 0.05$

In [3]:
# Let n be the sample size

In [2]:
import  scipy.stats as stats
import  numpy       as np

In [None]:
Z     = round(stats.norm.isf(q = 0.025),2) # We need to calculate the value of Z at (alpha/2) when alpha = 0.05
sigma = 25
D     = 5
Z

In [6]:
print(np.abs(Z))

1.96


In [7]:
n = round(((np.abs(Z) * sigma ) / D)**2,0)

In [8]:
print('The required sample size at a confidence level of %d%s is %d' % (95,'%',n))

The required sample size at a confidence level of 95% is 96


## Practice Exercise 1

#### Set the confidence level to be 90% in the above example 1 and calculate the required sample size

## Confidence Intervals    

* When there is an uncertainity around measuring the value of an important poulation parameter, it is better to find the range in which the range in which the value of the parameter is likely to lie rather than predicting a point estimate (single value).
* Confidence interval is the range in which the value of a population parameter is likely to lie with certain probability.
* Confidence interval provides additional information about the population parameter that will be useful in decision making.

### Confidence interval for population mean

Let $X_1$, $X_2$, $X_3$, ..., $X_n$ be the sample means of samples, $S_1$, $S_3$,  $S_3$, ..., $S_n$ that are drawn from an independent and identically distributed population with mean, $\mu$ and stamdard deviation, $\sigma$.

From the Central Limit Theorem, we know that the sample means, $X_i$ follows a normal distribution with mean, $\mu$ and standard deviation $\frac{\sigma} {\sqrt{n}}$.

The variable Z = $\frac{X_i - \mu}{\frac{\sigma} {\sqrt{n}}}$ follows a standard normal variable.

### Assume that we want to find (1 - $\alpha$) 100% confidence interval for the population mean. 
* We can distribute $\alpha$ (probability of not observing true population parameter mean in the interval) equally ($\alpha/2$) on either side of the distribution shown.

* For $\alpha$ = 0.05 or $\alpha/2$ = 0.025, that is 95% confidence interval, we can calculate lower and upper values of the confidence interval from the standard normal distribution.
* scipy.stats.norm.isf(q = 0.025) gives the value of Z for which the area under the normal distribution is less than 0.025.
* The corresponding value is approximately 1.96 as shown in the previous example.
* Using the transformation relationship between standard normal random variable Z and normal random variable X, we can write the 95% confidence interval for population mean when population standard deviation ($\sigma$) is known as:
$\overline{X} \pm 1.96 \frac {\sigma} {\sqrt{n}}$, where $\overline{X} is the estimated value of mean from a sample of size n.

#### In general, (1 - $\alpha$) 100% the confidence interval for the population mean when population standard deviation is known can be written as 

$\overline{X} \pm Z _\frac{\alpha}{2} \frac {\sigma} {\sqrt{n}}$

This equation is valid for large sample sizes, irrespective of the distribution of the population.

This is equivalent to

P($\overline{X} - Z_\frac{\alpha}{2} \times \frac{\sigma}{\sqrt{n}} \leq \mu \leq \overline{X} + Z_\frac{\alpha}{2} \times \frac{\sigma}{\sqrt{n}}$) = 1 - $\alpha$

The absolute values of $Z_\frac{\alpha}{2}$ are shown in the following table:

In [7]:
# Find alpha value when confidence level is 95%
import numpy as np

Significance_Level =  0.05

SL_2    =  Significance_Level / 2
Z        = np.abs(round(stats.norm.isf(q = SL_2),2))
#data           =  pd.DataFrame({"alpha": Significance_Level[i], "Z_alpha_by_2" : Z}, index = ['alpha'] )
print(Z)

1.96


In [9]:
df                 =  pd.DataFrame()
Significance_Level =  [0.1, 0.05, 0.02, 0.01] 

for i in range(len(Significance_Level)):
    SL_2    =  Significance_Level[i] / 2
    Z              = np.abs(round(stats.norm.isf(q = SL_2),2))
    data           =  pd.DataFrame({"alpha": Significance_Level[i], "Z_alpha_by_2" : Z}, index = ['alpha'] )
    df = df.append(data)
print(df)

       Z_alpha_by_2  alpha
alpha          1.64   0.10
alpha          1.96   0.05
alpha          2.33   0.02
alpha          2.58   0.01


| $\quad \alpha \quad$  | $Z_\frac{\alpha}{2}$ | Confidence Interval for $\mu$ when $\sigma$ is known |
| ----         | ----             | -------------------------------------------------------- |
| 0.1          | 1.64             |  $\overline{X} \quad \pm 1.64 \frac {\sigma} {\sqrt{n}}$  |
| 0.05         | 1.96             |  $\overline{X} \quad \pm 1.96 \frac {\sigma} {\sqrt{n}}$  |      
| 0.02         | 2.33             |  $\overline{X} \quad \pm 2.33 \frac {\sigma} {\sqrt{n}}$  |
| 0.01         | 2.58             |  $\overline{X} \quad \pm 2.58 \frac {\sigma} {\sqrt{n}}$  |                                                        |

### Example 2

A sample of 100 diabetic patients was chosen to estimate the length of stay at a local hospital. 
The sample was 4.5 days and the population standard deviation was known to be 1.2 days.

* a) Calculate the 95% confidence interval for the population mean.
* b) What is the probability that the population mean is greater than 4.73 days?

### Solution

a) Calculate the 95% confidence interval for the population mean.

It is known that 
* $\overline{X}$ = 4.5
* $\sigma$       = 1.2
* n              = 100
We need to compute $\overline{X} \quad \pm 1.96 \frac {\sigma} {\sqrt{n}}$ 

In [10]:
Xavg  = 4.5 
sigma = 1.2
n     = 100
Lower_Interval = Xavg - (1.96 * (sigma / np.sqrt(n)))
Upper_Interval = Xavg + (1.96 * (sigma / np.sqrt(n)))

print('95 %s confidence interval for population mean is %1.4f  to %1.4f' % ('%', Lower_Interval , Upper_Interval))

95 % confidence interval for population mean is 4.2648  to 4.7352


### Another method of calculating Confidence Interval using  stats.norm.interval()

In [11]:
ci = 0.95
s  = sigma / np.sqrt(n)
LCI, UCI = stats.norm.interval(ci, loc = Xavg, scale = s) # Give confidence interval 95%, mean and std as arguments to get CI
print('95 %s confidence interval for population mean is %1.4f  to %1.4f' % ('%', LCI , UCI))

95 % confidence interval for population mean is 4.2648  to 4.7352


b) What is the probability that the population mean is greater than 4.73 days?

We need to do the following:
* a. Calculate Z value corresponding to 4.73 by subtracting Xavg and divide by s
* b. find out the probability corresponding to the Z value using scipy.stats.norm.cdf and then subtracting from 1 since cdf gives cumulative probability upto the Z value sincd we are interested in finding the probability that the population mean is greater than Z

In [12]:
Z = (4.73 - Xavg) / s

In [13]:
P = 1- stats.norm.cdf(Z)

In [14]:
print('b. Probability that the population mean is greater than 4.73 days %1.4f' % P)

b. Probability that the population mean is greater than 4.73 days 0.0276


### Example 3

Hindustan Pencils Pvt. Ltd. is an Indian manufacturer of pencils, writing materials and other stationery items, established in 1958 in Mumbai. Nataraj brand of pencils manufactured by the company is expected to have a mean length of 172 mm and the standard deviation of the length is 0.02 mm.

To ensure quality, a sample is selected at periodic intervals to determine whether the length is still 172 mm and other dimensions of the pencil meet the quality standards set by the company.

You select a random sample of 100 pencils and the mean is 170 mm. 

Construct a 95% confidenct interval for the pencil length.

### Solution

It is known that 
* $\overline{X}$ = 172 mm
* $\sigma$       = 0.02 mm
* n              = 100
We need to compute $\overline{X} \quad \pm 1.96 \frac {\sigma} {\sqrt{n}}$ 

In [15]:
ci     = 0.95
Xavg   = 172
sigma  = 0.02
s      = sigma / np.sqrt(n)
LCI, UCI = stats.norm.interval(ci, loc = Xavg, scale = s) # Give confidence interval 95%, mean and std as arguments to get CI
print('95 %s confidence interval for population mean is %1.4f  to %1.4f' % ('%', LCI , UCI))

95 % confidence interval for population mean is 171.9961  to 172.0039


## Practice Exercise 2

Construct a 99% confidence interval for the following examples given above:
* a. Example 2
* b. Example 3

### Confidence interval for population proportion

The Central Limit Theorem for the population proportion is stated below:

If $X_1$, $X_2$, ...,$X_n$ are from Bernoulli trials with probability of success p, that is E($X_i$) = p and Var($X_i$) =pq (where a = 1- p), 
* then the sampling distribution of probability of success (say $\hat{p}$) for a large sample size follows an approximate normal distribution with mean p, and standard error $\sqrt{\frac{pq}{n}}$, where n is the sample size.


The variable, $\frac{\hat{p} - p}{\sqrt{\frac{pq}{n}}}$ converges to a standard normal distribution.

Note:

* The standard deviation of the sampling distribution of proportions depends on the value of p which is unknown.
* Rule of thumb: value of n is set to npq $\geq$ 10.

The (1 - $\alpha$)100% confidence interval for population proportion p is given by 

$\hat{p}$ - $Z_\frac{\alpha}{2} \sqrt{\frac{\hat(p)\hat(q)}{n}}$ $\leq$ p
$\leq$ $\hat{p}$ + $Z_\frac{\alpha}{2} \sqrt{\frac{\hat(p)\hat(q)}{n}}$ 

### Example 4

**A medical pharmacy was interested in finding the proportion of customers who pay cash for their medicines as against digital cash or plastic money.**

**From a sample of 200 customers, it was found that 140 customers paid by cash.** 

**Calculate the 95% confidence interval for proportions who pay by cash.**

In [None]:
In this example, 
* n          =  200
* $\hat{p}$  =  140 / 200 = 0.7
* $\hat{q}$  =  60  / 200 = 0.3

We find n X $\hat{p}$ X $\hat{q}$ = 200 * 0.7 * 0.3 = 42 $\geq$ 10, and hence we can use the confidence interval equation
$\hat{p}$ - $Z_\frac{\alpha}{2} \sqrt{\frac{\hat(p)\hat(q)}{n}}$ $\leq$ p
$\leq$ $\hat{p}$ + $Z_\frac{\alpha}{2} \sqrt{\frac{\hat(p)\hat(q)}{n}}$ 

W3e can use the function statsmodels.stats.proportion.proportion_confint to find out the required confidence interval by giving the following parameters:
* count  = Number of events of interest such as number of customers who paid by cash
* nobs   = Total number of observations
* alpha  = Level of significance for determining the confidence interval
* method = Method used for finding the confidence interval

In [16]:
n           = 200
n_success   = 140
alpha       = 0.05
LCI, UCI    = SMP.proportion_confint(count = n_success, nobs = n, alpha = alpha, method = 'normal')
print('95 %s confidence interval for population mean is %1.4f  to %1.4f' % ('%', LCI , UCI))

95 % confidence interval for population mean is 0.6365  to 0.7635


## Practice Exercise 3

#### Set the confidence level to be 90% in the above example 4 and calculate the required confidence interval

## Confidence interval for population mean when standard deviation is unknown

When the standard deviation of the population is unknown, then we will not be able to use the formula 
$\overline{X} \pm Z _\frac{\alpha}{2} \frac {\sigma} {\sqrt{n}}$

William Sealy Gosset under a pseudonym, Student in 1908 proved that if the population follows a normal distribution and the standard deviation taken from the sample, then the following statistic 
t = $\frac{\overline{X} - \mu} {\frac{S} {\sqrt{n}}}$ 
follows a t distribution with n-1 degrees of freedom.

* Here, S is the standard deviation (aka standard error) estimated from the sample.
* $\mu$ is the population mean.
* n is the sample size

Note: n - 1 is the degrees of freedom.

**Degrees of freedom for an estimate is equal to the number of values minus the number of parameters estimated to arrive at the estimate in question. In other words, the number of values in the final calculation of a statistic that are free to vary.** 

The t distribution is very similar to the standard normal distribution.  As the degrees of freedom increases, the t distribution converges to standard normal distribution.

The (1 - $\alpha$) 100% confidence interval for population mean when the population standard deviation is unknown is given as:

$\overline{X} \pm t _{\frac{\alpha}{{2,}}{n-1}} \frac {S} {\sqrt{n}}$

* In the above equation, the value, $t_{\frac{\alpha}{{2,}}{n-1}}$ is the value of t under the t -distribution for which the cumulative prbability.

F(t) = 0.025 when the degrees of freedom is n-1 and $\alpha$ is 0.05.
* Degrees of freedom is (n - 1) since sample of size n is used to estimate the standard deviation. 

For various values of $\alpha$, calculate the absolute values of $t_{\frac{\alpha}{{2,}}{n-1}}$ along with the $Z_\frac{\alpha}{2}$ values.

scipy.stats.t.ppf

In [17]:
df                 =  pd.DataFrame()
Significance_Level =  [0.1, 0.05, 0.02, 0.01] 
df_List            =  [10,  30,   50, 100, 500]

for i in range(len(Significance_Level)):
    SL_2    =  Significance_Level[i] / 2
    Z       =  np.abs(round(stats.norm.isf(q = SL_2), 2))
    t       =  [0,0,0,0,0]
    for j in range(len(df_List)):
        deg_fr  =  df_List[j]        
        t[j]       =  np.abs(round(stats.t.ppf( (1- SL_2), deg_fr),4))
    t_dict1 = {"a-0": Significance_Level[i],"t-10" : t[0], "t-30" : t[1],
               "t-50" : t[2],"t-100" : t[3], "t-500" : t[4], "z-9725" : Z}
    data    = pd.DataFrame(t_dict1, index = ['alpha'] )
    df      = df.append(data)
df          = df.reindex_axis(sorted(df.columns, 
                                     key=lambda x: float(x[1:]),reverse = True), axis=1)
print(df)

        a-0    t-10    t-30    t-50   t-100   t-500  z-9725
alpha  0.10  1.8125  1.6973  1.6759  1.6602  1.6479    1.64
alpha  0.05  2.2281  2.0423  2.0086  1.9840  1.9647    1.96
alpha  0.02  2.7638  2.4573  2.4033  2.3642  2.3338    2.33
alpha  0.01  3.1693  2.7500  2.6778  2.6259  2.5857    2.58


| $\alpha$ | t with df = 10 | t with df = 30 | t with df = 50 | t with df = 100 | t with df = 500 | Z with alpha/2 | 
| ---- | ------ | ----- | ---- | ------ | ----- | ---- |
| 0.10 | 1.8125 | 1.6973 | 1.6759 | 1.6602 | 1.6479 | 1.64 |
| 0.05 | 2.2281 | 2.0423 | 2.0086 | 1.9840 | 1.9647 | 1.96 |
| 0.02 | 2.7638 | 2.4573 | 2.4033 | 2.3642 | 2.3338 | 2.33 |
| 0.01 | 3.1693 | 2.7500 | 2.6778 | 2.6259 | 2.5857 | 2.58 |

**We observe that the values of t and Z converge for higher degrees of freedom.**

### Example 5

The following table contains the length of stay in minutes of each customer at a Fast Food restaurant.

|      |      |      |      |      |
| ---  | ---  | ---  | ---  | ---  |
| 7.42 | 6.29 | 5.83 | 6.50 | 8.34 |
| 9.51 | 7.10 | 6.80 | 5.90 | 4.89 |
| 6.50 | 5.52 | 7.90 | 8.30 | 9.60 |

* a. *Construct 95% confidence interval estimate for the population mean length of stay at Fast Food restaurant, assuming a normal distribution.*
* b. *Interpret the interval constructed at a.*

In [18]:
L = [7.42, 6.29, 5.83, 6.50, 8.34, 9.51, 7.10, \
     6.80, 5.90, 4.89, 6.50, 5.52, 7.90, 8.30, 9.60]
lengthStay = np.array(L)
avg        = lengthStay.mean(axis = 0)
sd         = np.std(lengthStay,ddof = 1)
print('Mean of length of stay is %1.4f' % avg)
print('SD of length of stay is %1.6f'   % sd)
# Here ddof modifies the divisor of the sum of the squares of the samples-minus-mean

Mean of length of stay is 7.0933
SD of length of stay is 1.406031


In [19]:
n      = 15
SL_2   = 0.025
deg_fr = n - 1
tval = np.abs(round(stats.t.ppf( (1- SL_2), deg_fr),4)) 
print(tval)

2.1448


In [27]:
LCI         = avg - (tval * (sd / np.sqrt(n)))
UCI         = avg + (tval * (sd / np.sqrt(n)))
print('a. 95 %s confidence interval for population mean is %1.4f  to %1.4f' % ('%', LCI , UCI))

a. 95 % confidence interval for population mean is 6.3147  to 7.8720


### Another method using scipy.stats

In [28]:
alpha       = 0.95
S           = sd / np.sqrt(n)
LCI, UCI    = stats.t.interval(alpha, deg_fr, avg, S)
print('a. 95 %s confidence interval for population mean is %1.4f  to %1.4f' % ('%', LCI , UCI))

a. 95 % confidence interval for population mean is 6.3147  to 7.8720


** b. You can be 95% confident that the mean length of stay at a Fast Food restaurant lies between 6.31 minutes to 7.87 minutes.**

## Practice Exercise 4

Time taken to resolve a customer complaints in days of 100 customers in a Service Organization is given below:

 |      |      |      |      |      |      |      |      |      |      |
 | ---  | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
 | 2.50 | 3.26 | 2.79 | 3.74 | 5.60 | 3.24 | 3.65 | 3.91 | 4.35 | 3.35 |
 | 5.67 | 5.38 | 3.54 | 5.10 | 3.66 | 3.01 | 3.96 | 4.98 | 4.56 | 5.00 |
 | 5.03 | 5.29 | 4.91 | 4.63 | 2.94 | 3.82 | 4.76 | 2.24 | 4.25 | 3.45 |
 | 3.14 | 4.64 | 4.56 | 4.61 | 2.68 | 3.61 | 5.46 | 2.83 | 4.84 | 4.31 |
 | 2.98 | 3.90 | 4.45 | 3.62 | 6.15 | 4.04 | 5.19 | 4.63 | 2.78 | 2.95 |
 | 3.65 | 4.49 | 3.52 | 4.07 | 4.16 | 5.56 | 2.69 | 6.69 | 1.26 | 3.14 |
 | 4.71 | 4.80 | 3.41 | 3.18 | 4.64 | 4.23 | 4.36 | 3.94 | 3.81 | 4.26 |
 | 2.92 | 2.87 | 2.08 | 3.09 | 3.60 | 2.93 | 3.85 | 4.66 | 4.70 | 3.61 |
 | 5.59 | 3.39 | 3.13 | 4.14 | 4.23 | 4.25 | 4.12 | 5.95 | 4.76 | 4.96 |
 | 2.27 | 3.77 | 5.25 | 3.05 | 3.20 | 5.22 | 3.84 | 2.24 | 4.75 | 3.07 |


* a. *Construct 95% confidence interval estimate for the population mean days to resolve customer complaints,
      assuming a normal distribution.*
* b. *Interpret the interval constructed at a.*

**Hint**

* 1) Use the following code to obtain the NumPy array, resolvedDays to solve this problem.

In [30]:
resolved_in_days = [2.50, 3.26, 2.79, 3.74, 5.60, 3.24, 3.65, 3.91, 4.35, 3.35,\
5.67, 5.38, 3.54, 5.10, 3.66, 3.01, 3.96, 4.98, 4.56, 5.00,\
5.03, 5.29, 4.91, 4.63, 2.94, 3.82, 4.76, 2.24, 4.25, 3.45,\
3.14, 4.64, 4.56, 4.61, 2.68, 3.61, 5.46, 2.83, 4.84, 4.31,\
2.98, 3.90, 4.45, 3.62, 6.15, 4.04, 5.19, 4.63, 2.78, 2.95,\
3.65, 4.49, 3.52, 4.07, 4.16, 5.56, 2.69, 6.69, 1.26, 3.14,\
4.71, 4.80, 3.41, 3.18, 4.64, 4.23, 4.36, 3.94, 3.81, 4.26,\
2.92, 2.87, 2.08, 3.09, 3.60, 2.93, 3.85, 4.66, 4.70, 3.61,\
5.59, 3.39, 3.13, 4.14, 4.23, 4.25, 4.12, 5.95, 4.76, 4.96,\
2.27, 3.77, 5.25, 3.05, 3.20, 5.22, 3.84, 2.24, 4.75, 3.07]
resolvedDays = np.array(resolved_in_days)

### Confidence interval for population variance

The random variable defined by $\frac{n-1 {S_i}^2}{\sigma^2}$ follows a $\chi^2$ distribution with n-1 degrees of freedom.

Here ${S_1}^2$, ${S_2}^2$,... ${S_n}^2$ are the sample variances estimated from samples of size n drawn from a normal distribution with variance ${\sigma }^2$.

The {1 - $\alpha$)100% confidence interval for variance, ${\sigma}^2$ is given by Tate and Klett (1959) and Cohen(1972).

$\bigg( \frac{(n-1) {S}^2}{{\chi}^2_{\frac{\alpha}{{2,}}{n-1}}}$ , $\frac{(n-1) {S}^2}{{\chi}^2_{\frac{1 -\alpha}{{2,}}{n-1}}} \bigg)$

### Example 6

The variance of volume of 20 litre water can is estimated to be 324 ml based on  a sample of 50 water cans. 

a. Calculate a 95% confidence interval for the variance in water cans.

In [30]:
n       = 50
df      = n-1
alpha   = 0.05
alpha_L = [(1 - (alpha/2)),(alpha/2)]
for i in range(len(alpha_L)):
   a = alpha_L[i]
   print('alpha %1.4f chi-square value %3.2f' % ((1-a),round(stats.chi2.ppf(a, df),2)))

alpha 0.0250 chi-square value 70.22
alpha 0.9750 chi-square value 31.55


95% confidence interval for variance is given by:
$\bigg( \frac{(n-1) {S}^2}{{\chi}^2_{\frac{\alpha}{{2,}}{n-1}}}$ , $\frac{(n-1) {S}^2}{{\chi}^2_{\frac{1 -\alpha}{{2,}}{n-1}}} \bigg)$

In [33]:
Var = 324
LCI = round((df * Var) / 70.22,2)
UCI = round((df * Var) / 31.55,2)
print('a. 95 %s confidence interval for population variance is %1.4f  to %1.4f' % ('%', LCI , UCI))

a. 95 % confidence interval for population variance is 226.0900  to 503.2000


### Practice Exercise 5

Construct a 99% confidence interval for example 6 given above

## Setting up hypothesis

*Beware of the problem of testing too many hypotheses; the more you torture the data, the more likely they are to confess, but confessions obtained under duress may not be admissible in the court of scientific opinion* - Stephen M Stigler

* Hypothesis is a claim made by a person / organization.

* The claim is usually about the population parameters such as mean or proportion and we seek evidence from a sample for the support of the claim (Example: average salary of Data Scientist with 1 year experience is Rs 5 Lakhs per annum).

* Hypothesis testing is a process used for either rejecting or retaining null hypothesis.

** Examples of some claims:**
*  If you drink Horlicks, you can grow taller, stronger and sharper.
*  Two - minute for cooking noodles. (or eating !!)
*  Married people are happier than singles (Anon - 2015).
*  Smokers are better sales people.

*Hypothesis testing is used for checking the validity of the claim using evidence found in sample data.*

### Type I Error, Type II error and power of the hypothesis test

### Type I error

* It is the conditional probability of rejecting a null hypothesis when it is true, is called **Type I error or False positive.**
* $\alpha$, the level of significance is the value of Type I error.
* P(Reject null hypothesis | $H_0$ is true) = $\alpha$

### Type II error

* It is the conditional probability of retaining a null hypothesis when it is true, is called **Type II error or False Negative.**
* $\beta$, is the value of Type II error.
* P(Retain null hypothesis | $H_0$ is false) = $\beta$

### Power of the test

* (1 - $\beta$) is known as the **power of the test**.
* It is P(Reject null hypothesis | $H_0$ is false) = 1- $\beta$

## Steps involved in solving the hypothesis testing

### 1 Define null and alternative hypotheses

* ### Null hypothesis means no relationship or status quo
* ### Alternative hypothesis is what the researcher wants to prove

### Example 7

Write the null and alternative hypothesis from the following hypopthesis description:
a. Average annual salary of Data Scientists is different for those having Ph.D in Statistics and those who do not.
* Let $\mu_{PhD}$ be the average annual salary of a Data scientist with Ph.D in Statistics.
* Let $\mu_{NoPhD}$ be the average annual salary of a Data scientist without Ph.D in Statistics.

* Null hypothesis:        $H_0$: $\mu_{PhD}$ =    $\mu_{NoPhD}$ 
* Alternative hypothesis: $H_A$: $\mu_{PhD}$ $\neq$ $\mu_{NoPhD}$ 

Since the rejection region is on either side of the distribution, it will be a **two-tailed** test.

b. Average annual salary of Data Scientists is more for those having Ph.D in Statistics and those who do not.

* Null hypothesis:        $H_0$: $\mu_{PhD}$ $\leq$   $\mu_{NoPhD}$ 
* Alternative hypothesis: $H_A$: $\mu_{PhD}$ >        $\mu_{NoPhD}$ 

Since the rejection region is on the right side of the distribution, it will be a one-tailed test.

### 2 Decide the significance level

* You control the Type I error by determining the risk level, $\alpha$, the level of significance that you are willing to reject the null hypothesis when it is true. Traditionally, you select a level of 0.01, 0.05 or 0.10. The choice of selection for making Type I error depends on the cost of making a Type I error.

* One way to reduce the probability of making a Type II error is by increasing the sample size. For a given level of $\alpha$, increasing the sample size decreases $\beta$ resulting in increasing the power of the statistical test to detect that null hypothesis is false.

### 3 Identify the test statistic

* ### The test statistic will depend on the probability distribution of the sampling distribution

### 4 Calculate the p-value or critical values

* ### P-value is the conditional probability of observing the test statistic value or extreme than the sample result when the null hypothesis is true.

* ### Critical value approach

* Critical values for the appropriate test statistic are selected so that the rejection region contains a total area of $\alpha$ when $H_0$ is true and the non-rejection region contains a total area of 1 - $\alpha$ when $H_0$ is true.

### 5 Decide to reject or accept null hypothesis

* ### Reject null hypothesis when test statisic lies in the rejection region; retain null hypothesis otherwise. 
* ### OR
* ### Reject null hypothesis when p-value < α; retain null hypothesis otherwise.


### Example 8

A beverages company produces mineral water and available in 250 ml, 500 ml, 1 litre and 2 litre bottles, 5 litre, 15 litre and 20 litre jars.
Let us consider 2 litre bottles. Company specification require a mean volume of 2 litre per bottle.
You must adjust the water filling process when the mean volume in the population of bottles differs from 2 litres. Adjusting the process requires shutting down the water filling production line completely, so you do not want to make any adjustments without any reason unnecessarily.

Assume a sample of 50 water bottles indicate a sample mean, $\overline{X}$ of 2.001 litres and the population standard deviation, $\sigma$ is 15 ml.

#### Hypothesis testing using the critical value approach

### Step 1: Define null and alternative hypotheses

In testing whether the mean volume is 2 litres, the null hypothesis states that mean volume, $\mu$ equals 2 litres. The alternative hypthesis states that the mean olume, $\mu$ is not equal  to 2 litres.
* $H_0$: $\mu$ = 2
* $H_A$: $\mu$ $\neq$ 2



### Step 2: Decide the significance level

Choose the $\alpha$, the level of significance according to the relative importance of the risks of committing Type I and Type II errors in the problem. 

In this example, making a Type I error means that you conclude that the population mean is not 2 litres when it is 2 litres. This implies that you will take corrective action on the filling process even though the process is working well (*false alarm*).

On the other hand, when the population mean is 1.98 litres and you conclude that the population mean is 2 litres, you commit a Type II error. Here, you allow the process to continue without adjustment, even though an adjustment is needed (*missed opportunity*).

Here, we select $\alpha$ = 0.05 and n, sample size = 50.

### Step 3:  Identify the test statistic

We know the population standard deviation and the sample is a large sample, n>30. So you use the normal distribution and the $Z_STAT$ test statistic.

### Step 4: Calculate the critical value

We know the $\alpha$ is 0.05. So, the critical values of the $Z_STAT$ test statistic are -1.96 and 1.96.

In [3]:
print(np.abs(round(stats.norm.isf(q = 0.025),2))) # Here we use alpha by 2  for two-tailed test

1.96


* ### Rejection region is $Z_{STAT}$ < -1.96 or $Z_{STAT}$ > 1.96.
* ### Acceptance or non-rejection regions is -1.96 $\leq$ $Z_{STAT}$ $\leq$ 1.96

We collect the sample data, calculate the test statistic. 
In our example, 
* $\overline{X}$ = 2.001
* $\mu$   = 2
* $\sigma$ = 15
* n       = 50
* $Z_{STAT} = \frac{\overline{X} - \mu} {\frac{\sigma}{\sqrt{n}}}$ 

In [5]:
XAvg  = 2.001
mu    = 2
sigma = 15
n     = 50
Z = (XAvg - mu)/(sigma/np.sqrt(n))
print('Value of Z is %2.5f' %Z)

Value of Z is 0.00047


### 5 Decide to reject or accept null hypothesis

In this example, Z = 0.00047 lies in the acceptance region because, 
-1.96 < Z = 0.00047 < 1.96.

So the statistical decision is not to reject the null hypothesis.

### So there is no sufficient evidence  to prove that the mean fill is different from 2 litres.

### Practice Example 6

In a bank, the average time taken for getting a demand draft or bankers cheque is 15 minutes.
From the past experience, you can assume that the population is normally distributed with a population standard deviation of 1.6 minutes. 

You select a sample of 50 requests for demand drafts and the sample mean is 14 minutes.

#### Use the five step approach listed above to deteremine whether there is evidence at a 5% level of significance that the population mean service time to get the demand draft has changed from the population mean of 15 minutes. 

## Hypothesis testing using p-value approach

Use the example 8.
To use the p-value approach, find the probability that the test statistic $Z_{STAT}$ is equal to or more extreme than 0.00047 standard error units from the centre of the standard normal distribution.

### Step 4:  Calculate the p-value 

* We need to compute P(Z < -0.00047) and P(Z > 0.00047).
* P value for the two-tail test is  P(Z < -0.00047) + P(Z > 0.00047).

In [11]:
Z = -0.00047
P1 = stats.norm.cdf(Z)
print(P1)
Z1 = 0.00047
P2 = 1- stats.norm.cdf(Z1)
print(P2)
P  = P1 + P2
print(P)

0.499812497135
0.499812497135
0.99962499427


### Step 5: Decide to reject or accept null hypothesis

Since the P value for this two tail test is 0.9996 and it is greater than 0.05, our level of significance, we do not reject the null hypothesis and conclude that there is no sufficient evidence to prove that the mean fill is different from 2 litres.

### Example 9:

A manufacturer claims that the mean lifetime of LED lamp is more than 50000 hours. Assume actual mean LED lamp lifetime is 49950 hours and population standard deviation is 120 hours. 

At 5% level of significance, what is the probability of having type II error for a sample size of 30 LED lamps?

In [62]:
n         = 30    # sample size
sigma     = 120  # population standard deviation
serr      = sigma / np.sqrt(n) # Standard Error

alpha     = 0.05     # significance level
mu0       = 50000 #  hypothetical lower bound
q         = int(round(stats.norm.isf(q = 1- alpha, loc = mu0, scale = serr),0))
print(q)

49964


* Assume actual mean LED lamp lifetime is 49950 hours 
* We need to find the P(Population mean $\geq$ 49950  | $H_A$ is true)

In [63]:
mu1  = 49950 # Actual mean

p = round(1 - stats.norm.cdf(q, loc = mu1, scale = serr),4)
print('At 5 %s level of significance, the probability of having type II error\n\
       for a sample size of 30 LED lamps is %2.4f' %('%',p))
print('At 5 %s level of significance, the POWER OF THE TEST\n \
      for a sample size of 30 LED lamps is %2.4f' %('%',1 - p))


At 5 % level of significance, the probability of having type II error
       for a sample size of 30 LED lamps is 0.2614
At 5 % level of significance, the POWER OF THE TEST
       for a sample size of 30 LED lamps is 0.7386


### Practice example 7

Change the 5% level of significance to 1% level of significance in the above example 9 and find out the probability of type II error.

### Take home exercises

### 1) Hours spent on studies by 20 students on a course is given in the following table:

|     |     |     |     |     |
| --- | --- | --- | --- | --- |   
| 4.7 | 9.2 | 9.3 | 11.2 | 8 |
| 7.6 | 7.4 | 4.9 | 9.2 | 5.3| 
| 1.7 | 2.8 | 7.2 | 12.3 | 8.6 |
| 10.6 | 9 | 5.7 | 6.9 | 3.8 |

Assume that the population of hours spent follows a normal distribution and the standard deviation is 3.1 hours, calculate the 90% confidence interval for the mean hours spent by the students.

Hint:

In [11]:
import numpy as np
H           = [ 4.7, 9.2, 9.3, 11.2, 8, 7.6, 7.4, 4.9, 9.2, 5.3, 1.7, 2.8, 7.2, 12.3, 8.6, 10.6, 9, 5.7, 6.9, 3.8]
HoursSpent  = np.array(H)

### 2) Assume the food label on a Lays chips bag states that there is at most 182 mg of sodium in a single chip. Assume the actual mean amount of sodium per chip is 182.09 mg and the population standard deviation is 0.20 mg. 
At 5% significance level, what is the probability of type II error for a sample of 40 chips?

## END