# Faculty Notebook Day2

In [1]:
import numpy       as np
import pandas      as pd
import scipy.stats as stats

In [3]:
#


# Introduction to hypothesis testing

*Beware of the problem of testing too many hypotheses; the more you torture the data, the more likely they are to confess, but confessions obtained under duress may not be admissible in the court of scientific opinion* - Stephen M Stigler

* Hypothesis is a claim made by a person / organization.

* The claim is usually about the population parameters such as mean or proportion and we seek evidence from a sample for the support of the claim (Example: average salary of Data Scientist with 1 year experience is Rs 5 Lakhs per annum).

* Hypothesis testing is a process used for either rejecting or retaining null hypothesis.

** Examples of some claims:**
*  If you drink Horlicks, you can grow taller, stronger and sharper.
*  Two - minute for cooking noodles. (or eating !!)
*  Married people are happier than singles (Anon - 2015).
*  Smokers are better sales people.

*Hypothesis testing is used for checking the validity of the claim using evidence found in sample data.*

### Type I Error, Type II error and power of the hypothesis test

### Type I error:

* It is the conditional probability of rejecting a null hypothesis when it is true, is called **Type I error or False positive.**
* $\alpha$, the level of significance is the value of Type I error.
* P(Reject null hypothesis | $H_0$ is true) = $\alpha$

### Type II error:

* It is the conditional probability of retaining a null hypothesis when it is true, is called **Type II error or False Negative.**
* $\beta$, is the value of Type II error.
* P(Retain null hypothesis | $H_0$ is false) = $\beta$

### Power of the test

* (1 - $\beta$) is known as the **power of the test**.
* It is P(Reject null hypothesis | $H_0$ is false) = 1- $\beta$

## Steps involved in solving the hypothesis testing

### 1 Define null and alternative hypotheses

* ### Null hypothesis means no relationship or status quo
* ### Alternative hypothesis is what the researcher wants to prove

### Example:

Write the null and alternative hypothesis from the following hypopthesis description:
a. Average annual salary of Data Scientists is different for those having Ph.D in Statistics and those who do not.
* Let $\mu_{PhD}$ be the average annual salary of a Data scientist with Ph.D in Statistics.
* Let $\mu_{NoPhD}$ be the average annual salary of a Data scientist without Ph.D in Statistics.

* Null hypothesis:        $H_0$: $\mu_{PhD}$ =    $\mu_{NoPhD}$ 
* Alternative hypothesis: $H_A$: $\mu_{PhD}$ $\neq$ $\mu_{NoPhD}$ 

Since the rejection region is on either side of the distribution, it will be a **two-tailed** test.

b. Average annual salary of Data Scientists is more for those having Ph.D in Statistics and those who do not.

* Null hypothesis:        $H_0$: $\mu_{PhD}$ $\leq$   $\mu_{NoPhD}$ 
* Alternative hypothesis: $H_A$: $\mu_{PhD}$ >        $\mu_{NoPhD}$ 

Since the rejection region is on the right side of the distribution, it will be a one-tailed test.

### 2 Decide the significance level

* You control the Type I error by determining the risk level, $\alpha$, the level of significance that you are willing to reject the null hypothesis when it is true. Traditionally, you select a level of 0.01, 0.05 or 0.10. The choice of selection for making Type I error depends on the cost of making a Type I error.

* One way to reduce the probability of making a Type II error is by increasing the sample size. For a given level of $\alpha$, increasing the sample size decreases $\beta$ resulting in increasing the power of the statistical test to detect that null hypothesis is false.

### 3 Identify the test statistic

* ### The test statistic will depend on the probability distribution of the sampling distribution

### 4 Calculate the p-value or critical values

* ### P-value is the conditional probability of observing the test statistic value or extreme than the sample result when the null hypothesis is true.

* ### Critical value approach

* Critical values for the appropriate test statistic are selected so that the rejection region contains a total area of $\alpha$ when $H_0$ is true and the non-rejection region contains a total area of 1 - $\alpha$ when $H_0$ is true.

### 5 Decide to reject or accept null hypothesis

* ### Reject null hypothesis when test statisic lies in the rejection region; retain null hypothesis otherwise. 
* ### OR
* ### Reject null hypothesis when p-value < α; retain null hypothesis otherwise.


#### Hypothesis testing using the critical value approach

### Step 1: Define null and alternative hypotheses

In testing whether the mean volume is 2 litres, the null hypothesis states that mean volume, $\mu$ equals 2 litres. The alternative hypthesis states that the mean olume, $\mu$ is not equal  to 2 litres.
* $H_0$: $\mu$ = 2
* $H_A$: $\mu$ $\neq$ 2



### Step 2: Decide the significance level

Choose the $\alpha$, the level of significance according to the relative importance of the risks of committing Type I and Type II errors in the problem. 

In this example, making a Type I error means that you conclude that the population mean is not 2 litres when it is 2 litres. This implies that you will take corrective action on the filling process even though the process is working well (*false alarm*).

On the other hand, when the population mean is 1.98 litres and you conclude that the population mean is 2 litres, you commit a Type II error. Here, you allow the process to continue without adjustment, even though an adjustment is needed (*missed opportunity*).

Here, we select $\alpha$ = 0.05 and n, sample size = 50.

### Step 3:  Identify the test statistic

We know the population standard deviation and the sample is a large sample, n>30. So you use the normal distribution and the $Z_STAT$ test statistic.

### Step 4: Calculate the critical value

We know the $\alpha$ is 0.05. So, the critical values of the $Z_STAT$ test statistic are -1.96 and 1.96.

In [6]:
print(np.abs(round(stats.norm.isf(q = 0.025),2))) # Here we use alpha by 2  for two-tailed test

1.96


* ### Rejection region is $Z_{STAT}$ < -1.96 or $Z_{STAT}$ > 1.96
* ### Acceptance or non-rejection regions is -1.96 $\leq$ $Z_{STAT}$ $\leq$ 1.96

We collect the sample data, calculate the test statistic. 
In our example, 
* $\overline{X}$ = 2.001
* $\mu$   = 2
* $\sigma$ = 15
* n       = 50
* $Z_{STAT} = \frac{\overline{X} - \mu} {\frac{\sigma}{\sqrt{n}}}$ 

In [0]:
XAvg  = 2.001
mu    = 2
sigma = 15
n     = 50
Z = (XAvg - mu)/(sigma/np.sqrt(n))
print('Value of Z observed is %2.5f' %Z)

Value of Z observed is 0.00047


### 5 Decide to reject or accept null hypothesis

In this example, Z = 0.00047 ( z observed) lies in the acceptance region because, 
-1.96 < Z = 0.00047 < 1.96.

Z observed is less than Z critical

So the statistical decision is not to reject the null hypothesis.

### So there is no sufficient evidence  to prove that the mean fill is different from 2 litres.

## One sample test

COURIER COMPANY EXAMPLE

A courier company claims that,their mean delivery time to any part of new york city is less than 3 hours.
Generate samples using randn function with the mean delivery time 2.8hrs with standard deviation =0.6 hrs

In [14]:
samples=np.random.randn(200)
s_std=(samples*0.6)+2.8
np.mean(s_std)

2.7902318555083845

In [15]:
np.std(s_std)

0.5804134511282852

Perform one-sample t-Test, to check the courier company claim is True/False

In [17]:
from scipy.stats import ttest_1samp
ttest_1samp(s_std,3)

Ttest_1sampResult(statistic=-5.098337789269435, pvalue=7.951796770402608e-07)

p-value, probability of null hypothesis being true is very negligible 7.9e-07, far below <0.05 for 95% CI. Hence we strongly reject Ho (Mean delivery >=3hrs). Alternate Hypothesis[claim by company] holds good Ha (Mean delivery < 3hrs). Hence we conclude courier company claim is true

SOYABEAN YIELD EXAMPLE:5.5 from 

Texas A&M Agriculture Agency claim that their mean yield of soyabean when compared to previous year, it increased to 520kg/per acre.Set up all the parts of a statistical test for the soybean example and use the sample
data to reach a decision on whether to accept or reject the null hypothesis.

Generate samples using randn with mean yield =573kg/acre with SD of 124

In [48]:
samples2=np.random.randn(200)
s2=(samples2*124)+573
np.mean(s2)

574.8676651952599

In [49]:
np.std(s2)

124.7612176137469

Set up the Hypothesis:
Ha: Mean yield >520kg/acre
Ho: Mean yield <=520kg/acre

In [50]:
ttest_1samp(s2,520)

Ttest_1sampResult(statistic=6.203880352679003, pvalue=3.134004248686627e-09)

Prob of Ho being True is very negligible ie.,3.13e-09,hence we reject Ho.Agency claim is TRUE
[for 5% error, p-val should be less than 0.05, to reject Ho]

LIVE CLASS DATA EXAMPLE (INFER THE AVERAGE AGE OF DSE-B-JULY-A BATCH)USING ONE_SAMPLE_t-TEST

In [0]:
#Let us assume entire class of 64 is population
age=np.array([22.25,24.5,24.75,25.75,21.75,25.5,28.7,23.25,27.5,24.5,23.6,24.75,27.75,22.25,22.75,22.9,24.75,25.75,27.5,25.1,25.0,22.25,22.5,25.1,22.5,24.1,21.75,22.5,26.0,24.6,24.75,21.6,24.25,27.5,24.25,22.75,22.5,25.75,26.3,24.75,24,22.75,23.08,24.75,24.25,23.8,23.1,24.5,24.75,25.6,24.3,26.2,26,30.25,22.0,23.7,26,26,26.4,23.4,23.9,21.5,26.9,23.4])

In [55]:
np.mean(age)

24.4809375

In [0]:
#Let us randomly pick 30 samples to infer the age of population
sample_age=np.array([22.25,24.6,22.5,23.9,21.5,26.9,24.1,21.75,22.5,26.0,22.5,25.1,25.75,26.3,24.75,24.5,24.75,25.75,21.75,23.08,24.75,24.25,22.5,27.75,22.25,24.25,23.8,23.4,23.9,23.25])

In [53]:
ttest_1samp(sample_age,28) #28yrs is not the representation of class (population)

Ttest_1sampResult(statistic=-13.599172166964355, pvalue=4.093251281298374e-14)

In [54]:
ttest_1samp(sample_age,24) #24yrs is the representation of class (True Mean of poulation is 24.48yrs)

Ttest_1sampResult(statistic=0.03750085079882189, pvalue=0.9703426245918079)

P-value>0.05, we fail to reject the Ho (ie) Ho=24 yrs
Hence 24yrs is the representation of class. It means, samples is the good representation of population (class)

### Very rarely we know the variance of the population. 

A common strategy to assess hypothesis is to conduct a t test. A t test can tell whether two groups have the same mean. 
A t test can be estimated for:
* 1) One sample t test
* 2) Two sample t test (including paired t test)

We assume that the samples are randomly selected, independent and come from a normally distributed population with unknown but equal variances.

### One sample t test

In [0]:
from scipy.stats             import ttest_1samp,ttest_ind, wilcoxon
from statsmodels.stats.power import ttest_power
import matplotlib.pyplot     as     plt

## Two sample t test for paired data

### Example 3

Compare two related samples. Data was collected on the marks scored by 25 students in their final practice exam and the marks scored by the students after attending special coaching classes conducted by their college.
At 5% level of significance, is there any evidence that the coaching classes has any effect on the marks scored.

In [0]:
Marks_before = np.array([ 52, 56, 61, 47, 58, 52, 56, 60, 52, 46, 51, 62, 54, 50, 48, 59, 56, 51, 52, 44, 52, 45, 57, 60, 45])

Marks_after  = np.array([62, 64, 40, 65, 76, 82, 53, 68, 77, 60, 69, 34, 69, 73, 67, 82, 62, 49, 44, 43, 77, 61, 67, 67, 54])

## Step 1: Define null and alternative hypotheses

In testing whether coaching has any effect on marks scored, the null hypothesis states that difference in marks, $\mu{After}$ equals $\mu{Before}$. The alternative hypthesis states that difference in marks is more than 0, $\mu{After}$ $\neq$ $\mu{Before}$

* $H_0$: $\mu{After}$ - $\mu{Before}$ =  0
* $H_A$: $\mu{After}$ - $\mu{Before}$ $\neq$  0

### Step 2: Decide the significance level

Here we select $\alpha$ = 0.05 and sample size < 30 and population standard deviation is not known.

### Step 3: Identify the test statistic

* Sample sizes for both samples are  same.
* We have two paired samples and we do not know the population standard deviation.
* The sample is not a large sample, n < 30. So you use the t distribution and the $t_STAT$ test statistic for two sample paired test.

### Step 4: Calculate the p - value and test statistic

** We use the scipy.stats.ttest_rel to calculate the T-test on TWO RELATED samples of scores.
This is a two-sided test for the null hypothesis that 2 related or repeated samples have identical average (expected) values. Here we give the two sample observations as input. This function returns t statistic and two-tailed p value.**

In [62]:
ttest_1samp(Marks_before-Marks_after,0)

Ttest_1sampResult(statistic=-3.404831324883169, pvalue=0.0023297583680290364)

### Step 5:  Decide to reject or accept null hypothesis

In this example, p value is 0.002 and it is less than 5% level of significance

So the statistical decision is to reject the null hypothesis at 5% level of significance.

### So there is  sufficient evidence  to reject the null hypothesis that there is an effect of coaching classes on marks scored by students.

### Example 4
** Alchohol consumption before and after love failure is given in the following table. Conduct a paired t test to check whether the alcholhol consumption is more after the love failure at 5% level of significance.**

## Step 1: Define null and alternative hypotheses

In testing whether breakup has any effect on alcohol consumption, the null hypothesis states that difference in alcohol consumption, $\mu{After}$ - $\mu{Before}$ is zero. The alternative hypthesis states that difference in alcohol consumption is more than 0, $\mu{After}$ -  $\mu{Before}$ $\neq$ zero.

* $H_0$: $\mu{After}$ - $\mu{Before}$ =  0
* $H_A$: $\mu{After}$ - $\mu{Before}$ $\neq$  0

### Step 2: Decide the significance level

Here we select α = 0.05 and sample size < 30 and population standard deviation is not known.

### Step 3: Identify the test statistic

* Sample sizes for both samples are  same.
* We have two paired samples and we do not know the population standard deviation.
* The sample is not a large sample, n < 30. So you use the t distribution and the $t_STAT$ test statistic for two sample paired test.

### Step 4: Calculate the p - value and test statistic

** We use the scipy.ttest_1samp to calculate the T-test on the difference between sample scores.**

In [0]:
import numpy as np

Alchohol_Consumption_before = np.array([470, 354, 496, 351, 349, 449, 378, 359, 469, 329, 389, 497, 493, 268, 445, 287, 338, 271, 412, 335])
Alchohol_Consumption_after  = np.array([408, 439, 321, 437, 335, 344, 318, 492, 531, 417, 358, 391, 398, 394, 508, 399, 345, 341, 326, 467])



In [64]:
import  scipy.stats  as stats  
t_statistic, p_value  =  stats.ttest_1samp(Alchohol_Consumption_before-Alchohol_Consumption_after, 0)
print('P Value %1.3f' % p_value)  

P Value 0.597


### Step 5:  Decide to reject or accept null hypothesis

In this example, p value is 0.597 and it is more than 5% level of significance

So the statistical decision is to accept the null hypothesis at 5% level of significance.

### There is  no sufficient evidence  to reject the null hypothesis. So we accept the null hypotheis and conclude that  there is no effect of love failure on alcohol consumption

**Two sample t test for independent data**

Compare the following two unrelated samples. Data was collected on the weight of women and men enrolled in a weight reduction program. At 𝛼 α = 0.05, test whether the weight of these two samples is different.

In [0]:
Weight_Female       =  [ 53.8, 54.4, 51.2, 52.5, 61.0, 50.6, 51.6, 70.0]
Weight_Male         =  [ 72.5, 80.3, 71.3, 67.7, 66.2, 73.4, 61.3, 76.8]

In [66]:
from scipy.stats import ttest_ind
stats.ttest_ind(Weight_Female,Weight_Male)

Ttest_indResult(statistic=-4.886344172533444, pvalue=0.00024034957515992796)

P-value <0.05, we reject Ho (Mean Weight of Female=Mean Weight of Male).Hence Ha holds good. ie.,there is a significant difference in mean weights of Male and Female

## End