## Setting up hypothesis, Setting up of confidence intervals

## Setting up hypothesis

*Beware of the problem of testing too many hypotheses; the more you torture the data, the more likely they are to confess, but confessions obtained under duress may not be admissible in the court of scientific opinion* - Stephen M Stigler

* Hypothesis is a claim made by a person / organization.

* The claim is usually about the population parameters such as mean or proportion and we seek evidence from a sample for the support of the claim (Example: average salary of Data Scientist with 1 year experience is Rs 5 Lakhs per annum).

* Hypothesis testing is a process used for either rejecting or retaining null hypothesis.

** Examples of some claims:**
*  If you drink Horlicks, you can grow taller, stronger and sharper.
*  Two - minute for cooking noodles. (or eating !!)
*  Married people are happier than singles (Anon - 2015).
*  Smokers are better sales people.

*Hypothesis testing is used for checking the validity of the claim using evidence found in sample data.*

### Type I Error, Type II error and power of the hypothesis test

### Type I error

* It is the conditional probability of rejecting a null hypothesis when it is true, is called **Type I error or False positive.**
* $\alpha$, the level of significance is the value of Type I error.
* P(Reject null hypothesis | $H_0$ is true) = $\alpha$

### Type II error

* It is the conditional probability of retaining a null hypothesis when it is true, is called **Type II error or False Negative.**
* $\beta$, is the value of Type II error.
* P(Retain null hypothesis | $H_0$ is false) = $\beta$

### Power of the test

* (1 - $\beta$) is known as the **power of the test**.
* It is P(Reject null hypothesis | $H_0$ is false) = 1- $\beta$

## One-sample Z test

### Example 1

### A beverages company produces mineral water and available in 250 ml, 500 ml, 1 litre and 2 litre bottles, 5 litre, 15 litre and 20 litre jars. Let us consider 2 litre bottles. Company specification require a mean volume of 2 litre per bottle. You must adjust the water filling process when the mean volume in the population of bottles differs from 2 litres. Adjusting the process requires shutting down the water filling production line completely, so you do not want to make any adjustments without any reason unnecessarily. Assume a sample of 50 water bottles indicate a sample mean, $\overline{X}$ of 2.001 litres and the population standard deviation, $\sigma$ is 15 ml.

#### Hypothesis testing using the critical value approach

In [3]:
import numpy       as np
import pandas      as pd
import scipy.stats as stats

### Step 1: Define null and alternative hypotheses

In testing whether the mean volume is 2 litres, the null hypothesis states that mean volume, $\mu$ equals 2 litres. The alternative hypthesis states that the mean olume, $\mu$ is not equal  to 2 litres.
* $H_0$: $\mu$ = 2
* $H_A$: $\mu$ $\neq$ 2



### Step 2: Decide the significance level

Choose the $\alpha$, the level of significance according to the relative importance of the risks of committing Type I and Type II errors in the problem. 

In this example, making a Type I error means that you conclude that the population mean is not 2 litres when it is 2 litres. This implies that you will take corrective action on the filling process even though the process is working well (*false alarm*).

On the other hand, when the population mean is 1.98 litres and you conclude that the population mean is 2 litres, you commit a Type II error. Here, you allow the process to continue without adjustment, even though an adjustment is needed (*missed opportunity*).

Here, we select $\alpha$ = 0.05 and n, sample size = 50.

### Step 3:  Identify the test statistic

We know the population standard deviation and the sample is a large sample, n>30. So you use the normal distribution and the $Z_STAT$ test statistic.

### Step 4: Calculate the critical value

We know the $\alpha$ is 0.05. So, the critical values of the $Z_STAT$ test statistic are -1.96 and 1.96.

In [20]:
print(np.abs(round(stats.norm.isf(q = 0.025),2))) # Here we use alpha by 2  for two-tailed test

1.96


* ### Rejection region is $Z_{STAT}$ < -1.96 or $Z_{STAT}$ > 1.96.
* ### Acceptance or non-rejection regions is -1.96 $\leq$ $Z_{STAT}$ $\leq$ 1.96

We collect the sample data, calculate the test statistic. 
In our example, 
* $\overline{X}$ = 2.001
* $\mu$   = 2
* $\sigma$ = 15
* n       = 50
* $Z_{STAT} = \frac{\overline{X} - \mu} {\frac{\sigma}{\sqrt{n}}}$ 

In [21]:
XAvg  = 2.001
mu    = 2
sigma = 15
n     = 50
Z = (XAvg - mu)/(sigma/np.sqrt(n))
print('Value of Z is %2.5f' %Z)

Value of Z is 0.00047


### 5 Decide to reject or accept null hypothesis

In this example, Z = 0.00047 lies in the acceptance region because, 
-1.96 < Z = 0.00047 < 1.96.

So the statistical decision is not to reject the null hypothesis.

### So there is no sufficient evidence  to prove that the mean fill is different from 2 litres.

### Example 2

### A principal of a prestigious city college claims that the average intelligence of the students of the college is above average. A random sample of 100 students IQ scores have  a mean score of 115. The mean population mean IQ is 100 with a standard deviation of 15.

**Is there sufficient evidence to support the principal's claim?**

### Solution: Let us work through the several required steps

In [16]:
import numpy       as np
import pandas      as pd
import scipy.stats as stats

### Step 1: Define null and alternative hypotheses

In testing whether the mean IQ of the students is more than 100, the null hypothesis states that mean IQ, $\mu$ equals 100. The alternative hypthesis states that the mean IQ, $\mu$ is greater  than 100.
* $H_0$: $\mu$ = 100
* $H_A$: $\mu$ > 100

### Step 2: Decide the significance level

Here we select $\alpha$ = 0.05 and it is given that n, sample size = 100.

### Step 3: Identify the test statistic

We know the population standard deviation and the sample is a large sample, n>30. So you use the normal distribution and the $Z_STAT$ test statistic.

### Step 4: Calculate the critical value and test statistic

In [17]:
Zcrit = round(stats.norm.isf(q = 0.05),2)
print('Value of Z critical is %3.6f' %Zcrit)              

Value of Z critical is 1.640000


We know the $\alpha$ is 0.05. So, the critical values of the $Z_STAT$ test statistic is 1.64

We collect the sample data, calculate the test statistic. 
In our example, 
* $\overline{X}$ = 115
* $\mu$          = 100
* $\sigma$       = 15
* n              = 100
* $Z_{STAT} = \frac{\overline{X} - \mu} {\frac{\sigma}{\sqrt{n}}}$ 

In [18]:
XAvg  = 115
mu    = 100
sigma = 15
n     = 100
Z = (XAvg - mu)/(sigma/np.sqrt(n))
print('Value of Z is %2.5f' %Z)

Value of Z is 10.00000


### 5 Decide to reject or accept null hypothesis

In this example, Z = 10 lies in the rejection region because, Z = 0.00047 > 1.64.

So the statistical decision is to reject the null hypothesis.

### So there is sufficient evidence  to prove that the mean average intelligence of the students of the college is above average.


##  In class LAB : Practice Example 1

### In a bank, the average time taken for getting a demand draft or bankers cheque is 15 minutes. From the past experience, you can assume that the population is normally distributed with a population standard deviation of 1.6 minutes. You select a sample of 50 requests for demand drafts and the sample mean is 14 minutes. Deteremine whether there is evidence at a 5% level of significance that the population mean service time to get the demand draft has changed from the population mean of 15 minutes. 

## Confidence Intervals    

* When there is an uncertainity around measuring the value of an important poulation parameter, it is better to find the range in which the range in which the value of the parameter is likely to lie rather than predicting a point estimate (single value).
* Confidence interval is the range in which the value of a population parameter is likely to lie with certain probability.
* Confidence interval provides additional information about the population parameter that will be useful in decision making.

### Confidence interval for population mean

Let $X_1$, $X_2$, $X_3$, ..., $X_n$ be the sample means of samples, $S_1$, $S_3$,  $S_3$, ..., $S_n$ that are drawn from an independent and identically distributed population with mean, $\mu$ and stamdard deviation, $\sigma$.

From the Central Limit Theorem, we know that the sample means, $X_i$ follows a normal distribution with mean, $\mu$ and standard deviation $\frac{\sigma} {\sqrt{n}}$.

The variable Z = $\frac{X_i - \mu}{\frac{\sigma} {\sqrt{n}}}$ follows a standard normal variable.

### Assume that we want to find (1 - $\alpha$) 100% confidence interval for the population mean. 
* We can distribute $\alpha$ (probability of not observing true population parameter mean in the interval) equally ($\alpha/2$) on either side of the distribution shown.

* For $\alpha$ = 0.05 or $\alpha/2$ = 0.025, that is 95% confidence interval, we can calculate lower and upper values of the confidence interval from the standard normal distribution.
* scipy.stats.norm.isf(q = 0.025) gives the value of Z for which the area under the normal distribution is less than 0.025.
* The corresponding value is approximately 1.96 as shown in the previous example.
* Using the transformation relationship between standard normal random variable Z and normal random variable X, we can write the 95% confidence interval for population mean when population standard deviation ($\sigma$) is known as:
$\overline{X} \pm 1.96 \frac {\sigma} {\sqrt{n}}$, where $\overline{X} is the estimated value of mean from a sample of size n.

#### In general, (1 - $\alpha$) 100% the confidence interval for the population mean when population standard deviation is known can be written as 

$\overline{X} \pm Z _\frac{\alpha}{2} \frac {\sigma} {\sqrt{n}}$

This equation is valid for large sample sizes, irrespective of the distribution of the population.

## Liquid Volume Example 1 - Find Confidence Interval
John is a quality control analyst in a plant which required to fill 500 ml of liquid in bottles. Past studies have revealed that the bottles are filled with standard deviation 3 ml. John wants to check if the volume of liquid filled in bottles has changed. If the volume has changed then it is required to halt the production and reconfigure the machines. John takes 40 random samples and measures the volume of filled liquid. The sample mean is 501.5 ml. At 95% confidence level, find the confidence interval?

In [19]:
se = 3/(40**0.5)
se
xbar = 501.5
z = stats.norm.isf(q=0.025)
LCI = xbar - z * se
print("LCI %4.2f" %LCI)
UCI = xbar + z * se
print("UCI %4.2f" %UCI)

LCI 500.57
UCI 502.43


## Example 2

A sample of 100 diabetic patients was chosen to estimate the length of stay at a local hospital. 
The sample was 4.5 days and the population standard deviation was known to be 1.2 days.

* a) Calculate the 95% confidence interval for the population mean.
* b) What is the probability that the population mean is greater than 4.73 days?

### Solution

a) Calculate the 95% confidence interval for the population mean.

It is known that 
* $\overline{X}$ = 4.5
* $\sigma$       = 1.2
* n              = 100
We need to compute $\overline{X} \quad \pm 1.96 \frac {\sigma} {\sqrt{n}}$ 

In [9]:
Xavg  = 4.5 
sigma = 1.2
n     = 100
Lower_Interval = Xavg - (1.96 * (sigma / np.sqrt(n)))
Upper_Interval = Xavg + (1.96 * (sigma / np.sqrt(n)))

print('95 %s confidence interval for population mean is %1.4f  to %1.4f' % ('%', Lower_Interval , Upper_Interval))

95 % confidence interval for population mean is 4.2648  to 4.7352


b) What is the probability that the population mean is greater than 4.73 days?

We need to do the following:
* a. Calculate Z value corresponding to 4.73 by subtracting Xavg and divide by s
* b. find out the probability corresponding to the Z value using scipy.stats.norm.cdf and then subtracting from 1 since cdf gives cumulative probability upto the Z value sincd we are interested in finding the probability that the population mean is greater than Z

In [11]:
Z = (4.73 - Xavg) / s

In [12]:
P = 1- stats.norm.cdf(Z)

In [13]:
print('b. Probability that the population mean is greater than 4.73 days %1.4f' % P)

b. Probability that the population mean is greater than 4.73 days 0.0276


### Example 3

Hindustan Pencils Pvt. Ltd. is an Indian manufacturer of pencils, writing materials and other stationery items, established in 1958 in Mumbai. Nataraj brand of pencils manufactured by the company is expected to have a mean length of 172 mm and the standard deviation of the length is 0.02 mm.

To ensure quality, a sample is selected at periodic intervals to determine whether the length is still 172 mm and other dimensions of the pencil meet the quality standards set by the company.

You select a random sample of 100 pencils and the mean is 170 mm. 

Construct a 95% confidenct interval for the pencil length.

### Solution

It is known that 
* $\overline{X}$ = 172 mm
* $\sigma$       = 0.02 mm
* n              = 100
We need to compute $\overline{X} \quad \pm 1.96 \frac {\sigma} {\sqrt{n}}$ 

In [14]:
ci     = 0.95
Xavg   = 172
sigma  = 0.02
s      = sigma / np.sqrt(n)
LCI, UCI = stats.norm.interval(ci, loc = Xavg, scale = s) # Give confidence interval 95%, mean and std as arguments to get CI
print('95 %s confidence interval for population mean is %1.4f  to %1.4f' % ('%', LCI , UCI))

95 % confidence interval for population mean is 171.9961  to 172.0039


## In class LAB : Practice Exercise 2

Construct a 99% confidence interval for the following examples given above:
* a. Example 2
* b. Example 3

## One-sample t-test

### Very rarely we know the variance of the population. 

A common strategy to assess hypothesis is to conduct a t test. A t test can tell whether two groups have the same mean. 
A t test can be estimated for:
* 1) One sample t test
* 2) Two sample t test (including paired t test)

We assume that the samples are randomly selected, independent and come from a normally distributed population with unknown but equal variances.

## Example 4 : Internet Mobile Time

Experian Marketing Services reported that the typical American spends a mean of 144 minutes (2.4 hours) per day accessing the Internet via a mobile device. (Source: The 2014 Digital Marketer, available at ex.pn/IkXJjfX.) In order to test the validity of this statement, you select a sample of 30 friends and family. The results for the time spent per day accessing the Internet via mobile device (in minutes) are stored in InternetMobileTime 

a. Is there evidence that the population mean time spent per day accessing the Internet via mobile device is different from 144 minutes? Use the p-value approach and a level of significance of 0.05. 

b. What assumption about the population distribution is needed in order to conduct the t test in (a)? 
Problem 9.35 from the Textbook adapted for Classroom Discussion(Chapter 9-page 314) 


In [1]:
import numpy as np
import pandas as pd
import os
import scipy.stats as stats

In [2]:
os.chdir("D:\\SMDM\\")
os.getcwd()
mydata = pd.read_csv('InternetMobileTime.csv')
xbar = mydata.mean()
print("Xbar: ", xbar)
S = mydata.std()
print("S: ", S)

FileNotFoundError: [WinError 2] The system cannot find the file specified: 'D:\\SMDM\\'

Step 1: Formulate Hypothesis
H0 : µ = 144
H1 : µ ≠ 144

Step 2: Given: n = 30, degree of freedom = 30 - 1 = 29, s = 139.84


Step 3: Define test statistic
tstat = (𝑥 ̅  − 𝜇) / (s /√𝑛)
= 1.2246 

Step 4: Draw diagram

Step 5: (critical value approach)
Determine critical values
α = 5% = 0.05. This is two tailed test.

In [103]:
stats.t.isf(0.025,29)

2.0452296421327034

tα/2 = 2.045 
-tα/2 = - 2.045 

Step 6: (critical value approach) 
Compare whether tstat value is in reject region and make decision.<br>
Since tstat(1.2246) is in Accept region, H0 is not rejected. 


In [104]:
stats.t.cdf(-1.2246,29) * 2

0.23058088073089986

Therefore, we are unable to prove that the population mean time spent per day accessing the Internet via mobile device is different from 144 minutes

## Another Direct Method

In [105]:
tstat, pvalue = ttest_1samp(mydata,144)
print("tstat: %3.5f   p-value:  %0.5f" %(tstat, pvalue))

tstat: 1.22467   p-value:  0.23055


## Example 5

Suppose that a doctor claims that 17 year olds have an average body temperature that is higher than the commonly accepted average human temperature of 98.6 degree F.

A simple random statistical sample of 25 people, each of age 17 is selected. 

| ID | Temperature |
| --- | ----- |
| 1 | 98.56 | 
| 2 | 98.66 |
| 3 | 98.54 |
| 4 | 98.71 |
| 5 | 99.22 |
| 6 | 99.49 |
| 7 | 98.14 |
| 8 | 98.84 |
| 9 | 99.28 |
| 10 | 98.48 |
| 11 | 99.88 |
| 12 | 97.29 |
| 13 | 98.88 |
| 14 | 99.07 |
| 15 | 98.81 |
| 16 | 99.49 |
| 17 | 98.57 |
| 18 | 98.98 |
| 19 | 98.75 |
| 20 | 98.69 |
| 21 | 99.28 |
| 22 | 99.52 |
| 23 | 99.22 |
| 24 | 99.01 |
| 25 | 99.02 |

In [106]:
temperature = np.array([98.56, 98.66, 98.54, 98.71, 99.22, 99.49, 98.14, 98.84,\
                         99.28, 98.48, 99.88, 97.29, 98.88, 99.07, 98.81, 99.49,\
                         98.57, 98.98, 98.75, 98.69, 99.28, 99.52, 99.22, 99.01, 99.02])            

In [107]:
print('Mean is %2.1f Sd is %2.2f' % (temperature.mean(),temperature.std()))

Mean is 98.9 Sd is 0.51


### Step 1: Define null and alternative hypotheses

In testing whether 17 year olds have an average body temperature that is higher than 98.6 deg F,the null hypothesis states that mean bdy temperature, $\mu$ equals 98.6. The alternative hypthesis states that the mean body temprature, $\mu$ is greater  than 98.6.

* $H_0$: $\mu$ <= 98.6
* $H_A$: $\mu$ > 98.6

### Step 2: Decide the significance level

Here we select $\alpha$ = 0.05 and it is given that n, sample size = 25.

### Step 3: Identify the test statistic

We do not know the population standard deviation and the sample is not a large sample, n < 30. So you use the t distribution and the $t_STAT$ test statistic.

### Step 4: Calculate the p - value and test statistic

**scipy.stats.ttest_1samp calculates the t test for the mean of one sample given the sample observations and  the expected value in the null hypothesis. This function returns t statistic and two-tailed p value.**

In [108]:
t_statistic, p_value = stats.ttest_1samp(temperature, 98.6)

In [112]:
print(t_statistic, p_value)

2.8431039481 0.00898088361729


In [113]:
print("p-value for one-tail:", p_value/2)

p-value for one-tail: 0.00449044180865


### Step 5: Decide to reject or accept null hypothesis

In this example, p value is 0.0044 and it is less than 5% level of significance

So the statistical decision is to reject the null hypothesis at 5% level of significance.

### So there is sufficient evidence  to prove that 17 year olds have an average body temperature that is higher than the commonly accepted average human temperature of 98.6 degree F.

## In class lab : Practice Exercise 3

You are given the daily sugar intake of 11 diabetic patients in the following Python code. 

**Is there any evidence to the claim that the average daily sugar intake of the diabetic patients is 7600 mg.** 

**Hint: Use t test**

In [115]:
# daily intake of Sugar in milligrams for 11 diabetic patients
import numpy as np
daily_intake = np.array([5560, 5770, 7640, 5180, 5690, 6435, 6803, 7689, 6876, 8213, 8765])

## Two sample test (t-test)

** Two sample t test (Snedecor and Cochran 1989) is used to determine if two population means are equal.
A common application is to test if a new treatment or approach or process is yielding better results than the current treatment or approach or process.**

* 1) Data is *paired* - For example, a group of students are given coaching classes and effect of coaching on the  marks scored is determined.
* 2) Data is *not paired* - For example, find out  whether the miles per gallon of  cars of Japanese make is superior to cars of Indian make.

## 1) Independent Two sample r-test - Pooled Variance Test

## Example 6 - Luggage

A hotel manager looks to enhance the initial impressions that hotel guests have when they check in. Contributing to initial impressions is the time it takes to deliver a guest’s luggage to the room after check-in. A random sample of 20 deliveries on a particular day were selected in Wing A of the hotel, and a random sample of 20 deliveries were selected in Wing B. The results are stored in Luggage . Analyze the data and determine whether there is a difference between the mean delivery times in the two wings of the hotel. (Use alpha = 0.05) <br>
Problem 10.83 from the Textbook adapted for Classroom Discussion(Chapter 10-page 387)

In [116]:
import numpy as np
import pandas as pd
import os
import scipy.stats as stats

In [93]:
os.chdir("D:\\SMDM\\")
os.getcwd()
mydata = pd.read_csv('Luggage.csv')
mydata.head()


Unnamed: 0,WingA,WingB
0,10.7,7.2
1,9.89,6.68
2,11.83,9.29
3,9.04,8.95
4,9.37,6.61


In [94]:
t_statistic, p_value  =  stats.ttest_ind(mydata.WingA,mydata.WingB)
print('tstat  %1.3f' % t_statistic)    
print('P Value %1.5f' % p_value)    

tstat  5.162
P Value 0.00001


Since p-value < α, therefore reject Null Hypothesis. <br>
<b>It has been proved statistically that there is a difference between the mean delivery times in the two wings of the hotel.</b>

## Example 7 - Weight Loss

Compare two unrelated samples. Data was collected on the weight loss of 16 women and 20 men enrolled in a weight reduction program.
At $\alpha$ = 0.05, test whether the weight loss of these two samples is different.

In [117]:
import numpy as np
import pandas as pd
import scipy.stats as stats

In [85]:
Weight_loss_Male   = [ 3.69, 4.12, 4.65, 3.19,  4.34, 3.68, 4.12, 4.50, 3.70, 3.09,3.65, 4.73, 3.93, 3.46, 3.28, 4.43, 4.13, 3.62, 3.71, 2.92]
Weight_loss_Female = [2.99, 1.80, 3.79, 4.12, 1.76, 3.50, 3.61, 2.32, 3.67, 4.26, 4.57, 3.01, 3.82, 4.33, 3.40, 3.86]

### Step 1: Define null and alternative hypotheses

In testing whether weight reduction of female and male are same,the null hypothesis states that mean weight reduction, $\mu{M}$ equals $\mu{F}$. The alternative hypthesis states that the weight reduction is different for Male and Female, $\mu{M}$ $\neq$ $\mu{F}$

* $H_0$: $\mu_M$ - $\mu_F$ =      0
* $H_A$: $\mu_M$ - $\mu_F$ $\neq$  0

### Step 2: Decide the significance level

Here we select $\alpha$ = 0.05 and sample size < 30 and population standard deviation is not known.

### Step 3: Identify the test statistic

* We have two samples and we do not know the population standard deviation.
* Sample sizes for both samples are not same.
* The sample is not a large sample, n < 30. So you use the t distribution and the $t_STAT$ test statistic for two sample unpaired test.

### Step 4: Calculate the p - value and test statistic

** We use the scipy.stats.ttest_ind to calculate the t-test for the means of TWO INDEPENDENT samples of scores given the two sample observations. This function returns t statistic and two-tailed p value.**

** This is a two-sided test for the null hypothesis that 2 independent samples have identical average (expected) values. This test assumes that the populations have identical variances.**

In [120]:
t_statistic, p_value  =  stats.ttest_ind(Weight_loss_Male,Weight_loss_Female)
print('tstat  %1.3f' % t_statistic)    
print('P Value %1.3f' % p_value)    

tstat  1.827
P Value 0.076


### Step 5:  Decide to reject or accept null hypothesis

In this example, p value is 0.076 and it is more than 5% level of significance

So the statistical decision is to accept the null hypothesis at 5% level of significance.

### So there is no sufficient evidence  to reject the null hypothesis that the weight loss of these men and women is same.

##  In class lab : Practice Exercise 4

Compare the following two unrelated samples. Data was collected on the weight of women and men enrolled in a weight reduction program.
At $\alpha$ = 0.05, test whether the weight of these two samples is different.

In [121]:
Weight_Female       =  [ 53.8, 54.4, 51.2, 52.5, 61.0, 50.6, 51.6, 70.0]
Weight_Male         =  [ 72.5, 80.3, 71.3, 67.7, 66.2, 73.4, 61.3, 76.8]

## 2) Two sample t test for paired data - Paired t test

## Example 8 - Concrete

The file Concrete1 contains the compressive strength, in thousands of pounds per square inch (psi), of 40 samples of concrete taken two and seven days after pouring. (Data extracted from O. Carrillo-Gamboa and R. F. Gunst, “Measurement-Error-Model Collinearities,” Technometrics, 34 (1992): 454–464.)

At the 0.01 level of significance, is there evidence that the mean strength is lower at two days than at seven days?

Problem 10.26 from the Textbook adapted for Classroom Discussion(Chapter 10-page 353)


In [124]:
import numpy as np
import pandas as pd
import os
import scipy.stats as stats

In [125]:
os.chdir("D:\\SMDM\\")
os.getcwd()
mydata = pd.read_csv('Concrete.csv')
mydata.head()


Unnamed: 0,Sample,TwoDays,SevenDays
0,1,2.83,3.505
1,2,3.295,3.43
2,3,2.71,3.67
3,4,2.855,3.355
4,5,2.98,3.985


** We use the scipy.stats.ttest_rel to calculate the T-test on TWO RELATED samples of scores.
This is a two-sided test for the null hypothesis that 2 related or repeated samples have identical average (expected) values. Here we give the two sample observations as input. This function returns t statistic and two-tailed p value.**

In [129]:
t_statistic, p_value  =  stats.ttest_rel(mydata.TwoDays,mydata.SevenDays,)
print('tstat  %1.3f' % t_statistic)    
print('P Value %1.7f' % p_value)    

tstat  -9.372
P Value 0.0000000


Since p-value < α, therefore reject Null Hypothesis. <br>
<b>At the 0.01 level of significance, there is significant evidence that the mean strength is lower at two days than at seven days.</b>

## Example 8 - Marks

Compare two related samples. Data was collected on the marks scored by 25 students in their final practice exam and the marks scored by the students after attending special coaching classes conducted by their college.
At 5% level of significance, is there any evidence that the coaching classes has any effect on the marks scored.

In [131]:
import numpy as np
import pandas as pd
import os
import scipy.stats as stats

In [130]:
Marks_before = [ 52, 56, 61, 47, 58, 52, 56, 60, 52, 46, 51, 62, 54, 50, 48, 59, 56, 51, 52, 44, 52, 45, 57, 60, 45]

Marks_after  = [62, 64, 40, 65, 76, 82, 53, 68, 77, 60, 69, 34, 69, 73, 67, 82, 62, 49, 44, 43, 77, 61, 67, 67, 54]

## Step 1: Define null and alternative hypotheses

In testing whether coaching has any effect on marks scored, the null hypothesis states that difference in marks, $\mu_{After}$ equals $\mu_{Before}$. The alternative hypthesis states that difference in marks is more than 0, $\mu_{After}$ $\neq$ $\mu_{Before}$

* $H_0$: $\mu_{After}$ - $\mu_{Before}$ =  0
* $H_A$: $\mu_{After}$ - $\mu_{Before}$ $\neq$  0

### Step 2: Decide the significance level

Here we select $\alpha$ = 0.05 and sample size < 30 and population standard deviation is not known.

### Step 3: Identify the test statistic

* Sample sizes for both samples are  same.
* We have two paired samples and we do not know the population standard deviation.
* The sample is not a large sample, n < 30. So you use the t distribution and the $t_STAT$ test statistic for two sample paired test.

### Step 4: Calculate the p - value and test statistic

** We use the scipy.stats.ttest_rel to calculate the T-test on TWO RELATED samples of scores.
This is a two-sided test for the null hypothesis that 2 related or repeated samples have identical average (expected) values. Here we give the two sample observations as input. This function returns t statistic and two-tailed p value.**

In [132]:
t_statistic, p_value  =  stats.ttest_rel(Marks_after, Marks_before )
print('P Value %1.3f' % p_value)  

P Value 0.002


### Step 5:  Decide to reject or accept null hypothesis

In this example, p value is 0.002 and it is less than 5% level of significance

So the statistical decision is to reject the null hypothesis at 5% level of significance.

### So there is  sufficient evidence  to reject the null hypothesis that there is an effect of coaching classes on marks scored by students.

##  In class lab : Practice Exercise 5

Here weight of 25 people were recorded before they had a new therapy and then again 6 months later. 
Check if new therapy leads to a change in weight.

In [134]:
wt_before = [76, 76, 72, 73, 64, 63, 75, 75, 71, 76, 71, 76, 78, 73, 76, 70, 71, 82, 84, 68, 70, 68, 66, 67, 74]
wt_after  = [63, 72, 67, 69, 58, 59, 70, 71, 70, 71, 68, 71, 72, 69, 72, 67, 67, 78, 79, 62, 67, 63, 61, 63, 69]

At 5% level of significance, is there any evidence that the new therapy has any effect on the weight of the participants?

Hint: Use a paired t test

### Take home exercises

**1 Example: The following data represent the amount of soft drink filled in a sample of 50 consecutive 2-liter bottles as shown below:**

|     _  |    _   |     _  |      _ |      _ |     _  |      _ |     _  |     _  |    _   |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| 2.109 | 2.086 | 2.066 | 2.075 | 2.065 | 2.057 | 2.052 | 2.044 | 2.036 | 2.038 | 
| 2.031 | 2.029 | 2.025 | 2.029 | 2.023 | 2.020 | 2.015 | 2.014 | 2.013 | 2.014 | 
| 2.012 | 2.012 | 2.012 | 2.010 | 2.005 | 2.003 | 1.999 | 1.996 | 1.997 | 1.992 | 
| 1.994 | 1.986 | 1.984 | 1.981 | 1.973 | 1.975 | 1.971 | 1.969 | 1.966 | 1.967 | 
| 1.963 | 1.957 | 1.951 | 1.951 | 1.947 | 1.941 | 1.941 | 1.938 | 1.908 | 1.894 | 

At 5% level of significance, is there evidence that the mean amount of soft drink filled is different from 2 litres? 


#### Hint: Use the following piece of code and try t test for one sample

In [137]:
import numpy       as np
import scipy.stats as stats
volume = np.array([2.109, 2.086, 2.066, 2.075, 2.065, 2.057, 2.052, 2.044, 2.036, 2.038, \
                   2.031, 2.029, 2.025, 2.029, 2.023, 2.020, 2.015, 2.014, 2.013, 2.014,\
                   2.012, 2.012, 2.012, 2.010, 2.005, 2.003, 1.999, 1.996, 1.997, 1.992,\
                   1.994, 1.986, 1.984, 1.981, 1.973, 1.975, 1.971, 1.969, 1.966, 1.967,\
                   1.963, 1.957, 1.951, 1.951, 1.947, 1.941, 1.941, 1.938, 1.908, 1.894])

print('Mean is %3.2f and standard deviation is %3.2f' %(volume.mean(),np.std(volume,ddof = 1)))

Mean is 2.00 and standard deviation is 0.04


**2. Sugar consumption in grams of 20 patients (both diabetic and non-diabetic) are given below:**

*At 5% level of significance, is there evidence that the mean sugar consumption is different for diabetic and non-diabetic?**    In the following table, 0 means diabetic and 1 means non-diabetic.*
    

In [139]:
import numpy       as np
import scipy.stats as stats
weight               = np.array([[9.31, 0],[7.76, 0],[6.98, 1],[7.88, 1],[8.49, 1],[10.05, 1],[8.80, 1],[10.88, 1],[6.13, 1],[7.90, 1], \
                            [11.51, 0],[12.59, 0],[7.05, 1],[11.85, 0],[9.99, 0],[7.48, 0],[8.79, 0],[8.69, 1],[9.68, 0],[8.58, 1],\
                           [9.19, 0],[8.11, 1]])

sugar_diabetic       = weight[:,1] == 0
sugar_diabetic       = weight[sugar_diabetic][:,0]
sugar_nondiabetic    = weight[:,1] == 1
sugar_nondiabetic    = weight[sugar_nondiabetic][:,0] 

#### Hint: 

Use the numpy array, sugar_diabetic and numpy array, sugar_nondiabetic for your analysis.

**3. The delivery time of Pizza from an online food deliery service firm and the home delivery from a local restaurant are given below: At 5% level of significance, is the mean delivery time for online delivery food service firm is less than the mean delivery time for the home delivery from a local restaurant.**

In [140]:
Pizza_delivery_online = [16.8, 11.7, 15.6, 16.7, 17.5, 18.1, 14.1, 21.8, 13.9, 20.8]
Pizza_delivery_local  = [22.0, 15.2, 18.7, 15.6, 20.8, 19.5, 17.0, 19.5, 16.5, 24.0]

#### Hint: Use paired t test

## END