# Inferential Statistics

# Introduction to hypothesis testing

*Beware of the problem of testing too many hypotheses; the more you torture the data, the more likely they are to confess, but confessions obtained under duress may not be admissible in the court of scientific opinion* - Stephen M Stigler

* Hypothesis is a claim made by a person / organization.

* The claim is usually about the population parameters such as mean or proportion and we seek evidence from a sample for the support of the claim (Example: average salary of Data Scientist with 1 year experience is Rs 5 Lakhs per annum).

* Hypothesis testing is a process used for either rejecting or retaining null hypothesis.

** Examples of some claims:**
*  If you drink Horlicks, you can grow taller, stronger and sharper.
*  Two - minute for cooking noodles. (or eating !!)
*  Married people are happier than singles (Anon - 2015).
*  Smokers are better sales people.

*Hypothesis testing is used for checking the validity of the claim using evidence found in sample data.*

### Type I Error, Type II error and power of the hypothesis test

### Type I error:

* It is the conditional probability of rejecting a null hypothesis when it is true, is called **Type I error or False positive.**
* $\alpha$, the level of significance is the value of Type I error.
* P(Reject null hypothesis | $H_0$ is true) = $\alpha$

### Type II error:

* It is the conditional probability of retaining a null hypothesis when it is true, is called **Type II error or False Negative.**
* $\beta$, is the value of Type II error.
* P(Retain null hypothesis | $H_0$ is false) = $\beta$

### Power of the test

* (1 - $\beta$) is known as the **power of the test**.
* It is P(Reject null hypothesis | $H_0$ is false) = 1- $\beta$

## Steps involved in solving the hypothesis testing

### 1 Define null and alternative hypotheses

* ### Null hypothesis means no relationship or status quo
* ### Alternative hypothesis is what the researcher wants to prove

### Example:

Write the null and alternative hypothesis from the following hypopthesis description:
a. Average annual salary of Data Scientists is different for those having Ph.D in Statistics and those who do not.
* Let $\mu_{PhD}$ be the average annual salary of a Data scientist with Ph.D in Statistics.
* Let $\mu_{NoPhD}$ be the average annual salary of a Data scientist without Ph.D in Statistics.

* Null hypothesis:        $H_0$: $\mu_{PhD}$ =    $\mu_{NoPhD}$ 
* Alternative hypothesis: $H_A$: $\mu_{PhD}$ $\neq$ $\mu_{NoPhD}$ 

Since the rejection region is on either side of the distribution, it will be a **two-tailed** test.

b. Average annual salary of Data Scientists is more for those having Ph.D in Statistics and those who do not.

* Null hypothesis:        $H_0$: $\mu_{PhD}$ $\leq$   $\mu_{NoPhD}$ 
* Alternative hypothesis: $H_A$: $\mu_{PhD}$ >        $\mu_{NoPhD}$ 

Since the rejection region is on the right side of the distribution, it will be a one-tailed test.

### 2 Decide the significance level

* You control the Type I error by determining the risk level, $\alpha$, the level of significance that you are willing to reject the null hypothesis when it is true. Traditionally, you select a level of 0.01, 0.05 or 0.10. The choice of selection for making Type I error depends on the cost of making a Type I error.

* One way to reduce the probability of making a Type II error is by increasing the sample size. For a given level of $\alpha$, increasing the sample size decreases $\beta$ resulting in increasing the power of the statistical test to detect that null hypothesis is false.

### 3 Identify the test statistic

* ### The test statistic will depend on the probability distribution of the sampling distribution

### 4 Calculate the p-value or critical values

* ### P-value is the conditional probability of observing the test statistic value or extreme than the sample result when the null hypothesis is true.

* ### Critical value approach

* Critical values for the appropriate test statistic are selected so that the rejection region contains a total area of $\alpha$ when $H_0$ is true and the non-rejection region contains a total area of 1 - $\alpha$ when $H_0$ is true.

### 5 Decide to reject or accept null hypothesis

* ### Reject null hypothesis when test statisic lies in the rejection region; retain null hypothesis otherwise. 
* ### OR
* ### Reject null hypothesis when p-value < α; retain null hypothesis otherwise.


#### Hypothesis testing using the critical value approach

### Step 1: Define null and alternative hypotheses

In testing whether the mean volume is 2 litres, the null hypothesis states that mean volume, $\mu$ equals 2 litres. The alternative hypthesis states that the mean olume, $\mu$ is not equal  to 2 litres.
* $H_0$: $\mu$ = 2
* $H_A$: $\mu$ $\neq$ 2



### Step 2: Decide the significance level

Choose the $\alpha$, the level of significance according to the relative importance of the risks of committing Type I and Type II errors in the problem. 

In this example, making a Type I error means that you conclude that the population mean is not 2 litres when it is 2 litres. This implies that you will take corrective action on the filling process even though the process is working well (*false alarm*).

On the other hand, when the population mean is 1.98 litres and you conclude that the population mean is 2 litres, you commit a Type II error. Here, you allow the process to continue without adjustment, even though an adjustment is needed (*missed opportunity*).

Here, we select $\alpha$ = 0.05 and n, sample size = 50.

### Step 3:  Identify the test statistic

We know the population standard deviation and the sample is a large sample, n>30. So you use the normal distribution and the $Z_STAT$ test statistic.

### Step 4: Calculate the critical value

We know the $\alpha$ is 0.05. So, the critical values of the $Z_STAT$ test statistic are -1.96 and 1.96.

In [1]:
import numpy       as np
import pandas      as pd
import scipy.stats as stats
print(np.abs(round(stats.norm.isf(q = 0.025),2))) # Here we use alpha by 2  for two-tailed test

1.96


* ### Rejection region is $Z_{STAT}$ < -1.96 or $Z_{STAT}$ > 1.96
* ### Acceptance or non-rejection regions is -1.96 $\leq$ $Z_{STAT}$ $\leq$ 1.96

We collect the sample data, calculate the test statistic. 
In our example, 
* $\overline{X}$ = 2.001
* $\mu$   = 2
* $\sigma$ = 15
* n       = 50
* $Z_{STAT} = \frac{\overline{X} - \mu} {\frac{\sigma}{\sqrt{n}}}$ 

In [14]:
XAvg  = 2.001
mu    = 2
sigma = 15
n     = 50
Z = (XAvg - mu)/(sigma/np.sqrt(n))
print('Value of Z observed is %2.5f' %Z)

Value of Z observed is 0.00047


### 5 Decide to reject or accept null hypothesis

In this example, Z = 0.00047 ( z observed) lies in the acceptance region because, 
-1.96 < Z = 0.00047 < 1.96.

Z observed is less than Z critical

So the statistical decision is not to reject the null hypothesis.

### So there is no sufficient evidence  to prove that the mean fill is different from 2 litres.