## Confidence Intervals    

* When there is an uncertainity around measuring the value of an important poulation parameter, it is better to find the range in which the range in which the value of the parameter is likely to lie rather than predicting a point estimate (single value).
* Confidence interval is the range in which the value of a population parameter is likely to lie with certain probability.
* Confidence interval provides additional information about the population parameter that will be useful in decision making.

### Confidence interval for population mean

Let $X_1$, $X_2$, $X_3$, ..., $X_n$ be the sample means of samples, $S_1$, $S_3$,  $S_3$, ..., $S_n$ that are drawn from an independent and identically distributed population with mean, $\mu$ and stamdard deviation, $\sigma$.

From the Central Limit Theorem, we know that the sample means, $X_i$ follows a normal distribution with mean, $\mu$ and standard deviation $\frac{\sigma} {\sqrt{n}}$.

The variable Z = $\frac{X_i - \mu}{\frac{\sigma} {\sqrt{n}}}$ follows a standard normal variable.

### Assume that we want to find (1 - $\alpha$) 100% confidence interval for the population mean. 
* We can distribute $\alpha$ (probability of not observing true population parameter mean in the interval) equally ($\alpha/2$) on either side of the distribution shown.

* For $\alpha$ = 0.05 or $\alpha/2$ = 0.025, that is 95% confidence interval, we can calculate lower and upper values of the confidence interval from the standard normal distribution.
* scipy.stats.norm.isf(q = 0.025) gives the value of Z for which the area under the normal distribution is less than 0.025.
* The corresponding value is approximately 1.96 as shown in the previous example.
* Using the transformation relationship between standard normal random variable Z and normal random variable X, we can write the 95% confidence interval for population mean when population standard deviation ($\sigma$) is known as:
$\overline{X} \pm 1.96 \frac {\sigma} {\sqrt{n}}$, where $\overline{X} is the estimated value of mean from a sample of size n.

#### In general, (1 - $\alpha$) 100% the confidence interval for the population mean when population standard deviation is known can be written as 

$\overline{X} \pm Z _\frac{\alpha}{2} \frac {\sigma} {\sqrt{n}}$

This equation is valid for large sample sizes, irrespective of the distribution of the population.

This is equivalent to

P($\overline{X} - Z_\frac{\alpha}{2} \times \frac{\sigma}{\sqrt{n}} \leq \mu \leq \overline{X} + Z_\frac{\alpha}{2} \times \frac{\sigma}{\sqrt{n}}$) = 1 - $\alpha$

The absolute values of $Z_\frac{\alpha}{2}$ are shown in the following table:

In [0]:
df                 =  pd.DataFrame()
Significance_Level =  [0.1, 0.05, 0.02, 0.01] 

for i in range(len(Significance_Level)):
    SL_2    =  Significance_Level[i] / 2
    Z              = np.abs(round(stats.norm.isf(q = SL_2),2))
    data           =  pd.DataFrame({"alpha": Significance_Level[i], "Z_alpha_by_2" : Z}, index = ['alpha'] )
    df = df.append(data)
print(df)

       Z_alpha_by_2  alpha
alpha          1.64   0.10
alpha          1.96   0.05
alpha          2.33   0.02
alpha          2.58   0.01


| $\quad \alpha \quad$  | $Z_\frac{\alpha}{2}$ | Confidence Interval for $\mu$ when $\sigma$ is known |
| ----         | ----             | -------------------------------------------------------- |
| 0.1          | 1.64             |  $\overline{X} \quad \pm 1.64 \frac {\sigma} {\sqrt{n}}$  |
| 0.05         | 1.96             |  $\overline{X} \quad \pm 1.96 \frac {\sigma} {\sqrt{n}}$  |      
| 0.02         | 2.33             |  $\overline{X} \quad \pm 2.33 \frac {\sigma} {\sqrt{n}}$  |
| 0.01         | 2.58             |  $\overline{X} \quad \pm 2.58 \frac {\sigma} {\sqrt{n}}$  |                                                        |

### EXERCISE

A sample of 100 diabetic patients was chosen to estimate the length of stay at a local hospital. 
The sample was 4.5 days and the population standard deviation was known to be 1.2 days.

* a) Calculate the 95% confidence interval for the population mean.
* b) What is the probability that the population mean is greater than 4.73 days?

We need to do the following:
* a. Calculate Z value corresponding to 4.73 by subtracting Xavg and divide by s
* b. find out the probability corresponding to the Z value using scipy.stats.norm.cdf and then subtracting from 1 since cdf gives cumulative probability upto the Z value sincd we are interested in finding the probability that the population mean is greater than Z

In [1]:
#  the 95% confidence interval for the population mean.
print(4.5 - ((1.96 * 1.2)/(100)**(1/2))," < ci < ",4.5 + ((1.96 * 1.2)/(100)**(1/2)))

4.2648  < ci <  4.7352


In [5]:
#  the probability that the population mean is greater than 4.73 days
import scipy.stats as st
z = (4.73 - 4.5)/1.2
print("probability",(1-st.norm.cdf(z))*100,"%")

probability 42.40016589978113 %


### EXERCISE

Hindustan Pencils Pvt. Ltd. is an Indian manufacturer of pencils, writing materials and other stationery items, established in 1958 in Mumbai. Nataraj brand of pencils manufactured by the company is expected to have a mean length of 172 mm and the standard deviation of the length is 0.02 mm.

To ensure quality, a sample is selected at periodic intervals to determine whether the length is still 172 mm and other dimensions of the pencil meet the quality standards set by the company.

You select a random sample of 100 pencils and the mean is 170 mm. 

Construct a 95% confidenct interval for the pencil length.

In [6]:
print(170 - ((1.96 * 0.02)/(100)**(1/2))," < ci < ",170 + ((1.96 * 0.02)/(100)**(1/2)))

169.99608  < ci <  170.00392


### Confidence interval for population proportion

The Central Limit Theorem for the population proportion is stated below:

If $X_1$, $X_2$, ...,$X_n$ are from Bernoulli trials with probability of success p, that is E($X_i$) = p and Var($X_i$) =pq (where a = 1- p), 
* then the sampling distribution of probability of success (say $\hat{p}$) for a large sample size follows an approximate normal distribution with mean p, and standard error $\sqrt{\frac{pq}{n}}$, where n is the sample size.


The variable, $\frac{\hat{p} - p}{\sqrt{\frac{pq}{n}}}$ converges to a standard normal distribution.

Note:

* The standard deviation of the sampling distribution of proportions depends on the value of p which is unknown.
* Rule of thumb: value of n is set to npq $\geq$ 10.

The (1 - $\alpha$)100% confidence interval for population proportion p is given by 

$\hat{p}$ - $Z_\frac{\alpha}{2} \sqrt{\frac{\hat(p)\hat(q)}{n}}$ $\leq$ p
$\leq$ $\hat{p}$ + $Z_\frac{\alpha}{2} \sqrt{\frac{\hat(p)\hat(q)}{n}}$ 

### EXERCISE

**A medical pharmacy was interested in finding the proportion of customers who pay cash for their medicines as against digital cash or plastic money.**

**From a sample of 200 customers, it was found that 140 customers paid by cash.** 

**Calculate the 95% confidence interval for proportions who pay by cash.**

In this example, 
* n          =  200
* $\hat{p}$  =  140 / 200 = 0.7
* $\hat{q}$  =  60  / 200 = 0.3

We find n X $\hat{p}$ X $\hat{q}$ = 200 * 0.7 * 0.3 = 42 $\geq$ 10, and hence we can use the confidence interval equation
$\hat{p}$ - $Z_\frac{\alpha}{2} \sqrt{\frac{\hat(p)\hat(q)}{n}}$ $\leq$ p
$\leq$ $\hat{p}$ + $Z_\frac{\alpha}{2} \sqrt{\frac{\hat(p)\hat(q)}{n}}$ 

W3e can use the function statsmodels.stats.proportion.proportion_confint to find out the required confidence interval by giving the following parameters:
* count  = Number of events of interest such as number of customers who paid by cash
* nobs   = Total number of observations
* alpha  = Level of significance for determining the confidence interval
* method = Method used for finding the confidence interval

In [12]:
import statsmodels.stats.proportion as ss
low,up=ss.proportion_confint(140,200,alpha=0.05,method='normal')
print(low," < ci < ",up)

0.6364899081898882  < ci <  0.7635100918101118


## Confidence interval for population mean when standard deviation is unknown

When the standard deviation of the population is unknown, then we will not be able to use the formula 
$\overline{X} \pm Z _\frac{\alpha}{2} \frac {\sigma} {\sqrt{n}}$

William Sealy Gosset under a pseudonym, Student in 1908 proved that if the population follows a normal distribution and the standard deviation taken from the sample, then the following statistic 
t = $\frac{\overline{X} - \mu} {\frac{S} {\sqrt{n}}}$ 
follows a t distribution with n-1 degrees of freedom.

* Here, S is the standard deviation (aka standard error) estimated from the sample.
* $\mu$ is the population mean.
* n is the sample size

Note: n - 1 is the degrees of freedom.

**Degrees of freedom for an estimate is equal to the number of values minus the number of parameters estimated to arrive at the estimate in question. In other words, the number of values in the final calculation of a statistic that are free to vary.** 

The t distribution is very similar to the standard normal distribution.  As the degrees of freedom increases, the t distribution converges to standard normal distribution.

The (1 - $\alpha$) 100% confidence interval for population mean when the population standard deviation is unknown is given as:

$\overline{X} \pm t _{\frac{\alpha}{{2,}}{n-1}} \frac {S} {\sqrt{n}}$

* In the above equation, the value, $t_{\frac{\alpha}{{2,}}{n-1}}$ is the value of t under the t -distribution for which the cumulative prbability.

F(t) = 0.025 when the degrees of freedom is n-1 and $\alpha$ is 0.05.
* Degrees of freedom is (n - 1) since sample of size n is used to estimate the standard deviation. 

For various values of $\alpha$, calculate the absolute values of $t_{\frac{\alpha}{{2,}}{n-1}}$ along with the $Z_\frac{\alpha}{2}$ values.

scipy.stats.t.ppf

In [0]:
df                 =  pd.DataFrame()
Significance_Level =  [0.1, 0.05, 0.02, 0.01] 
df_List            =  [10,  30,   50, 100, 500]

for i in range(len(Significance_Level)):
    SL_2    =  Significance_Level[i] / 2
    Z       =  np.abs(round(stats.norm.isf(q = SL_2), 2))
    t       =  [0,0,0,0,0]
    for j in range(len(df_List)):
        deg_fr  =  df_List[j]        
        t[j]       =  np.abs(round(stats.t.ppf( (1- SL_2), deg_fr),4))
    t_dict1 = {"a-0": Significance_Level[i],"t-10" : t[0], "t-30" : t[1],
               "t-50" : t[2],"t-100" : t[3], "t-500" : t[4], "z-9725" : Z}
    data    = pd.DataFrame(t_dict1, index = ['alpha'] )
    df      = df.append(data)
df          = df.reindex_axis(sorted(df.columns, 
                                     key=lambda x: float(x[1:]),reverse = True), axis=1)
print(df)

        a-0    t-10    t-30    t-50   t-100   t-500  z-9725
alpha  0.10  1.8125  1.6973  1.6759  1.6602  1.6479    1.64
alpha  0.05  2.2281  2.0423  2.0086  1.9840  1.9647    1.96
alpha  0.02  2.7638  2.4573  2.4033  2.3642  2.3338    2.33
alpha  0.01  3.1693  2.7500  2.6778  2.6259  2.5857    2.58


| $\alpha$ | t with df = 10 | t with df = 30 | t with df = 50 | t with df = 100 | t with df = 500 | Z with alpha/2 | 
| ---- | ------ | ----- | ---- | ------ | ----- | ---- |
| 0.10 | 1.8125 | 1.6973 | 1.6759 | 1.6602 | 1.6479 | 1.64 |
| 0.05 | 2.2281 | 2.0423 | 2.0086 | 1.9840 | 1.9647 | 1.96 |
| 0.02 | 2.7638 | 2.4573 | 2.4033 | 2.3642 | 2.3338 | 2.33 |
| 0.01 | 3.1693 | 2.7500 | 2.6778 | 2.6259 | 2.5857 | 2.58 |

**We observe that the values of t and Z converge for higher degrees of freedom.**

### EXERCISE

The following table contains the length of stay in minutes of each customer at a Fast Food restaurant.

|      |      |      |      |      |
| ---  | ---  | ---  | ---  | ---  |
| 7.42 | 6.29 | 5.83 | 6.50 | 8.34 |
| 9.51 | 7.10 | 6.80 | 5.90 | 4.89 |
| 6.50 | 5.52 | 7.90 | 8.30 | 9.60 |

* a. *Construct 95% confidence interval estimate for the population mean length of stay at Fast Food restaurant, assuming a normal distribution.*
* b. *Interpret the interval constructed at a.*

In [15]:
m=(7.42+6.29+5.83+6.50+8.34+9.51+7.10+6.80+5.90+4.89+6.50+5.52+7.90+8.30+9.60)/15
l=[7.42,6.29,5.83,6.50,8.34,9.51,7.10,6.80,5.90,4.89,6.50,5.52,7.90,8.30,9.60]
p=0
for i in l:
    p=p+(i-m)**2
s=(p/15)**(1/2)
print(m-((1.96 * s)/(15)**(1/2))," < ci < ", m+((1.96 * s)/(15)**(1/2)))

6.4059107491075125  < ci <  7.780755917559151


In [None]:
# Average stay in minutes of each customer at a fast food restaurant will always be in the the range or interval interpreted
# in the above cell i.e., 6.4059 std deviation back from mean and 7.7807 std deviation ahead of mean.

### Confidence interval for population variance

The random variable defined by $\frac{n-1 {S_i}^2}{\sigma^2}$ follows a $\chi^2$ distribution with n-1 degrees of freedom.

Here ${S_1}^2$, ${S_2}^2$,... ${S_n}^2$ are the sample variances estimated from samples of size n drawn from a normal distribution with variance ${\sigma }^2$.

The {1 - $\alpha$)100% confidence interval for variance, ${\sigma}^2$ is given by Tate and Klett (1959) and Cohen(1972).

$\bigg( \frac{(n-1) {S}^2}{{\chi}^2_{\frac{\alpha}{{2,}}{n-1}}}$ , $\frac{(n-1) {S}^2}{{\chi}^2_{\frac{1 -\alpha}{{2,}}{n-1}}} \bigg)$

### EXERCISE

The variance of volume of 20 litre water can is estimated to be 324 ml based on  a sample of 50 water cans. 

a. Calculate a 95% confidence interval for the variance in water cans.

In [14]:
print(((50-1) * 324)/66.339,",",((50-1) * 324)/33.930)

239.31623931623932 , 467.9045092838196


95% confidence interval for variance is given by:
$\bigg( \frac{(n-1) {S}^2}{{\chi}^2_{\frac{\alpha}{{2,}}{n-1}}}$ , $\frac{(n-1) {S}^2}{{\chi}^2_{\frac{1 -\alpha}{{2,}}{n-1}}} \bigg)$

## Setting up hypothesis

*Beware of the problem of testing too many hypotheses; the more you torture the data, the more likely they are to confess, but confessions obtained under duress may not be admissible in the court of scientific opinion* - Stephen M Stigler

* Hypothesis is a claim made by a person / organization.

* The claim is usually about the population parameters such as mean or proportion and we seek evidence from a sample for the support of the claim (Example: average salary of Data Scientist with 1 year experience is Rs 5 Lakhs per annum).

* Hypothesis testing is a process used for either rejecting or retaining null hypothesis.

** Examples of some claims:**
*  If you drink Horlicks, you can grow taller, stronger and sharper.
*  Two - minute for cooking noodles. (or eating !!)
*  Married people are happier than singles (Anon - 2015).
*  Smokers are better sales people.

*Hypothesis testing is used for checking the validity of the claim using evidence found in sample data.*

### Type I Error, Type II error and power of the hypothesis test

### Type I error

* It is the conditional probability of rejecting a null hypothesis when it is true, is called **Type I error or False positive.**
* $\alpha$, the level of significance is the value of Type I error.
* P(Reject null hypothesis | $H_0$ is true) = $\alpha$

### Type II error

* It is the conditional probability of retaining a null hypothesis when it is true, is called **Type II error or False Negative.**
* $\beta$, is the value of Type II error.
* P(Retain null hypothesis | $H_0$ is false) = $\beta$

### Power of the test

* (1 - $\beta$) is known as the **power of the test**.
* It is P(Reject null hypothesis | $H_0$ is false) = 1- $\beta$

## Steps involved in solving the hypothesis testing

### 1 Define null and alternative hypotheses

* ### Null hypothesis means no relationship or status quo
* ### Alternative hypothesis is what the researcher wants to prove

### EXERCISE

Write the null and alternative hypothesis from the following hypopthesis description:
a. Average annual salary of Data Scientists is different for those having Ph.D in Statistics and those who do not.
* Let $\mu_{PhD}$ be the average annual salary of a Data scientist with Ph.D in Statistics.
* Let $\mu_{NoPhD}$ be the average annual salary of a Data scientist without Ph.D in Statistics.

* Null hypothesis:        $H_0$: $\mu_{PhD}$ =    $\mu_{NoPhD}$ 
* Alternative hypothesis: $H_A$: $\mu_{PhD}$ $\neq$ $\mu_{NoPhD}$ 

Since the rejection region is on either side of the distribution, it will be a **two-tailed** test.

b. Average annual salary of Data Scientists is more for those having Ph.D in Statistics and those who do not.

* Null hypothesis:        $H_0$: $\mu_{PhD}$ $\leq$   $\mu_{NoPhD}$ 
* Alternative hypothesis: $H_A$: $\mu_{PhD}$ >        $\mu_{NoPhD}$ 

Since the rejection region is on the right side of the distribution, it will be a one-tailed test.

### 2 Decide the significance level

* You control the Type I error by determining the risk level, $\alpha$, the level of significance that you are willing to reject the null hypothesis when it is true. Traditionally, you select a level of 0.01, 0.05 or 0.10. The choice of selection for making Type I error depends on the cost of making a Type I error.

* One way to reduce the probability of making a Type II error is by increasing the sample size. For a given level of $\alpha$, increasing the sample size decreases $\beta$ resulting in increasing the power of the statistical test to detect that null hypothesis is false.

### 3 Identify the test statistic

* ### The test statistic will depend on the probability distribution of the sampling distribution

### 4 Calculate the p-value or critical values

* ### P-value is the conditional probability of observing the test statistic value or extreme than the sample result when the null hypothesis is true.

* ### Critical value approach

* Critical values for the appropriate test statistic are selected so that the rejection region contains a total area of $\alpha$ when $H_0$ is true and the non-rejection region contains a total area of 1 - $\alpha$ when $H_0$ is true.

### 5 Decide to reject or accept null hypothesis

* ### Reject null hypothesis when test statisic lies in the rejection region; retain null hypothesis otherwise. 
* ### OR
* ### Reject null hypothesis when p-value < α; retain null hypothesis otherwise.


### EXERCISE

A beverages company produces mineral water and available in 250 ml, 500 ml, 1 litre and 2 litre bottles, 5 litre, 15 litre and 20 litre jars.
Let us consider 2 litre bottles. Company specification require a mean volume of 2 litre per bottle.
You must adjust the water filling process when the mean volume in the population of bottles differs from 2 litres. Adjusting the process requires shutting down the water filling production line completely, so you do not want to make any adjustments without any reason unnecessarily.

Assume a sample of 50 water bottles indicate a sample mean, $\overline{X}$ of 2.001 litres and the population standard deviation, $\sigma$ is 15 ml.

#### Hypothesis testing using the critical value approach

In [19]:
# 1- Null hypothesis : 𝜇 = 2.00 litres
#    Alternate hypothesis : 𝜇 ≠ 2.00 litres
# 2- alpha = 0.05
# 3- z score
# 4- below
z = (2.001-2.00)/(0.015/((50)**(1/2)))
# 5- check hypothesis
if(z>1.96 or z<-1.96):
    print("Null hypothesis rejected")
else:
    print("Null hypothesis is true")

Null hypothesis is true


### EXERCISE

A manufacturer claims that the mean lifetime of LED lamp is more than 50000 hours. Assume actual mean LED lamp lifetime is 49950 hours and population standard deviation is 120 hours. 

At 5% level of significance, what is the probability of having type II error for a sample size of 30 LED lamps?

* Assume actual mean LED lamp lifetime is 49950 hours 
* We need to find the P(Population mean $\geq$ 49950  | $H_A$ is true)

In [23]:
# 1- Null hypothesis : 𝜇 ≥ 50000 hours
#    Alternate hypothesis 𝜇 < 50000 hours
# 2- alpha = 0.05
# 3- Probability using z score test
# 4- calculate below
print("p(Population mean ≥ 50000 | 𝜇 = 49950)")
print("p( z ≥ ",(50000 - 49950)/(120/((30)**(1/2)))," )")

p(Population mean ≥ 50000 | 𝜇 = 49950)
p( z ≥  2.282177322938192  )


In [24]:
print("As z score is nearly less than 2.326, hence the probability is 0.010")

As z score is nearly less than 2.326, hence the probability is 0.010
