<a id="lib"></a>
#  Import Libraries

In [3]:
import pandas as pd
import numpy as np
import scipy.stats as stats

<a id="defn"></a>
# Test of Hypothesis

Hypothesis testing is the process of evaluating the validity of claims made about the population using the sample data obtained from the population. A statistical test is a rule used to decide whether to reject or retain the claim.

**Examples of hypothesis:**

        1. One can get 'A' grade if the attendance in the class is more than 75%.
        2. A probiotic drink can improve the immunity of a person. 

<a id="types"></a>
## Types of Hypothesis

`Null Hypothesis`: The null hypothesis is the statistical hypothesis suggesting 'no difference' between the population parameter and a specific value
. It is denoted as H<sub>0</sub>.

`Alternative Hypothesis`: It is the hypothesis that is tested against the null hypothesis and states the existence of a difference between the parameter and a specific value. It is denoted by H<sub>a</sub> or H<sub>1</sub>.

The claim is usually the alternative hypothesis H<sub>1</sub> also known as research hypothesis. To test the claim we need to gather evidence (data ) and find the likelihood of the data under the assumption H<sub>0</sub> being true.

A company that produces tennis balls claimed that the diameter of a tennis ball is at least 2.625 inches on average. On the other hand, a professional tennis coach claimed that the diameter of a ball is less than what the company has claimed. To test the claim of the coach, a statistical test can be performed considering the hypothesis:

                    Null Hypothesis: Average diameter ≥ 2.625
                    Alternative Hypothesis: Average diameter < 2.625

The mean distollic blood pressure for a group of 85 adult is less than 90mm

Question:

Write the null and alternative hypothesis for the following scenarios:
1. A bolt manufacturing company claims that the average number of bolts,
it manufactures in a day is 60
2. The new virus testing kit takes 20 hours less time than usual testing kit that takes
34 hours for the results
3. An analyst wants to test whether the Apple Inc. has outperformed in 2019 by 13.5%


<a id="test_type"></a>
# Types of Test

The hypothesis test is used to validate the claims about the population. The types of tests are based on the nature of the alternative hypothesis. 

<a id="2tailed"></a>
## Two Tailed Test

Two tailed test considers the value of the population parameter is less than or greater than (i.e. not equal) a specific value. <br>
If we test the population mean ($\mu$) with a specific value ($\mu_{0}$) the null hypothesis is: $H_{0}: \mu = \mu_{0}$. 

The alternative hypothesis for the two tailed test is given as: $H_{1}: \mu \neq \mu_{0}$

#### Example:

A company that produces tennis balls claimed that the diameter of a tennis ball is 2.625 inches on average. To test the company's claim, a statistical test can be performed considering the hypothesis:

                    Null Hypothesis: Average diameter = 2.625
                    Alternative Hypothesis: Average diameter ≠ 2.625

<a id="1tailed"></a>
## One Tailed Test

One tailed test considers the value of the population parameter is less than or greater than (but not both) a specific value. <br>
If we test the population mean ($\mu$) with a specific value ($\mu_{0}$) the null hypothesis is: $H_{0}: \mu \leq \mu_{0}$ and the alternative hypothesis is $H_{1}: \mu > \mu_{0}$, the one tailed test is also known as a `right-tailed test`.

If we test the population mean ($\mu$) with a specific value ($\mu_{0}$) the null hypothesis is: $H_{0}: \mu \geq \mu_{0}$ and the alternative hypothesis is $H_{1}: \mu < \mu_{0}$, the one tailed test is also known as a `left-tailed test`.


### Example:

**1.** The company's annual quality report of machines states that a lathe machine works efficiently at most for 8 months on average after the servicing. The production manager claims that after the special tuxan servicing, the machine works efficiently for more than 8 months. To test the claim of production manager consider the hypothesis:

                    Null Hypothesis: Machine efficiency ≤ 8 months
                    Alternative Hypothesis: Machine efficiency > 8 months

This is the example of a **right-tailed test**. 

**2.** A railway authority claims that all the trains on the Chicago-Seattle route run with a speed of at least 54 mph on average. A customer forum declares that there are various records from passengers claiming that the speed of the train is less than what railway has claimed. In this scenario, a statistical test can be performed to test the claim of customer forum considering the hypothesis:

                    Null Hypothesis: Speed ≥ 56 mph
                    Alternative Hypothesis: Speed < 56 mph

This is the example of a **left-tailed test**. 

<a id="eg"></a>
#  Hypothesis Tests with Z Statistic

Let us perform one sample Z test for the population mean. We compare the population mean with a specific value. The sample is assumed to be taken from a population following a normal distribution.

To check the normality of the data, a test for normality is used. The `Shapiro-Wilk Test` is one of the methods used to check the normality. The hypothesis of the test is given as:
<p style='text-indent:25em'> <strong> H<sub>0</sub>:  The data is normally distributed </strong> </p>
<p style='text-indent:25em'> <strong> H<sub>1</sub>:  The data is not normally distributed </strong> </p>

The `shapiro()` from scipy library performs a Shapiro-Wilk normality test. 

The null and alternative hypothesis of Z-test is given as:
<p style='text-indent:25em'> <strong> $H_{0}: \mu = \mu_{0}$ or $\mu \geq \mu_{0}$ or $\mu \leq \mu_{0}$</strong></p>
<p style='text-indent:25em'> <strong> $H_{1}: \mu \neq \mu_{0}$ or $\mu < \mu_{0}$ or $\mu > \mu_{0}$</strong></p>

Consider a normal population with standard deviation $\sigma$. Let us take a sample of size n, such that (n > 30). 
The test statistic for one sample Z-test is given as:
<p style='text-indent:25em'> <strong> $Z = \frac{\overline{X} -  \mu}{\frac{\sigma}{\sqrt(n)}}$</strong></p>

Where, <br>
$\overline{X}$: Sample mean<br>
$\mu$: Specified mean<br>
$\sigma$: Population standard deviation<br>
$n$: Sample size

Under $H_{0}$ the test statistic follows a standard normal distribution.

If $\sigma$ is unknown, use the sample standard deviation (s) instead of $\sigma$ to calculate the test statistic.


#### 1. A car manufacturing company claims that the mileage of their new car is 25 kmph with a standard deviation of 2.5 kmph. A random sample of 45 cars was drawn and recorded their mileage as per the standard procedure. From the sample, the mean mileage was seen to be 24 kmph. Is this evidence to claim that the mean mileage is different from 25kmph? (assume the normality of the data) Use α = 0.01.

The null and alternative hypothesis is:

H<sub>0</sub>: $\mu = 25 $<br>
H<sub>1</sub>: $\mu ≠ 25 $

Here ⍺ = 0.01, for a two-tailed test calculate the critical z-value.

In [4]:
stats.norm.interval(confidence=0.99)

(-2.5758293035489004, 2.5758293035489004)

In [5]:
mu=25
x_bar=24
sigma=2.5
n=45
se=sigma/(np.sqrt(n))

z1=(x_bar-mu)/se
print('z-value (Test Statistic)',z1)

z-value (Test Statistic) -2.6832815729997477


In [6]:
# Same thing can be done using p value

### P-Value

In [7]:
p_value=stats.norm.cdf(x_bar,loc=mu,scale=se)
p_value

0.003645179045767819

In [8]:
stats.norm.isf(0.10)

1.2815515655446004

In [9]:
if p_value<0.005 or p_value > 0.995:
    print('Reject Ho:')
else:
    print('Fail to reject Ha:')

Reject Ho:


In [10]:
alpha=0.01
if p_value<(alpha/2) or p_value > (alpha/2) :
    print('Reject Ho:')
else:
    print('Fail to reject Ha:')

Reject Ho:


#### 2. The average calories in a slice bread of the brand 'Alphas' are 82 with a standard deviation of 15. An experiment is conducted to test the claim of the dietitians that the calories in a slice of bread are not as per the manufacturer's specification. A sample of 40 slices of bread is taken and the mean calories recorded are 95. Test the claim of dietitians with ⍺ value (significance level) as 0.05. (assume the normality of the data).

In [11]:
# Ho: mu<=82
# Ha: mu>82

In [12]:
stats.norm.interval(confidence=0.95)

(-1.959963984540054, 1.959963984540054)

In [13]:
mu=82
x_bar=95
sigma=15
n=40
se=sigma/(np.sqrt(n))

z1=(x_bar-mu)/se
print('z-value',z1)

z-value 5.4812812776251905


In [14]:
p_value=stats.norm.cdf(x_bar,loc=mu,scale=se)
p_value

0.9999999788871737

In [15]:
alpha=0.05
if p_value>(1-alpha):
    print('Reject Ho or the avg calorie in bread is greater than 82')
else:
    print('Fail to reject Ha or the avg calorie in bread is <= 82')

Reject Ho or the avg calorie in bread is greater than 82


#### 3. A typhoid vaccine in the market inscribes 3 mg of ascorbic acid in the vaccine. A research team claims that the vaccines contain less than 3 mg of acid. We collected the data of 40 vaccines by using random sampling from a population and recorded the amount of ascorbic acid. Test the claim of the research team using the sample data ⍺ value (significance level) to 0.05.

    acid_amt = [2.57, 3.06, 3.28 , 3.24, 2.79, 3.40, 3.36, 3.07, 2.46, 3.03, 3.05, 2.94, 3.46, 3.19, 3.09, 2.81, 3.13, 2.88, 
                2.76, 2.75, 3.17, 2.89, 2.54, 3.18, 3.08, 2.60, 3.06, 3.13, 3.11, 3.08, 2.93, 2.90, 3.06, 2.97, 3.24, 2.86, 
                2.87, 3.18, 3, 2.95]

In [16]:
acid_amt = [2.57, 3.06, 3.28 , 3.24, 2.79, 3.40, 3.36, 3.07, 2.46, 3.03, 3.05, 2.94, 3.46, 
            3.19, 3.09, 2.81, 3.13, 2.88, 2.76, 2.75, 3.17, 2.89, 2.54, 3.18, 3.08, 2.60, 
            3.06, 3.13, 3.11, 3.08, 2.93, 2.90, 3.06, 2.97, 3.24, 2.86, 2.87, 3.18, 3, 2.95]

In [17]:
# Check for assumption
p_value=stats.shapiro(acid_amt)[1]
p_value

0.5609374046325684

In [18]:
if p_value<0.05:
    print('Reject Ho or data is not normal')
else:
    print('Fail to reject Ha or data is normal')

Fail to reject Ha or data is normal


#### 4. A sample of 900 PVC pipes is found to have an average thickness of 12.5 mm. Can we assume that the sample is coming from a normal population with mean 13 mm against that it is less than 13 mm. The population standard deviation is 1 mm. Test the hypothesis at 5% level of significance.

In [None]:
# Ho: 
# Ha:


In [19]:
# Ho: mu>=13
# Ha: mu<13
sigma=1
n=900
mu=13
x_bar=12.5
alpha=0.05
se=sigma/(np.sqrt(n))
p_value=stats.norm.cdf(x_bar,loc=mu,scale=se)
print('P_Value',p_value)
if p_value<alpha:
    print('Reject Ho ')
else:
    print('Fail to reject Ha ')

P_Value 3.6709661993126986e-51
Reject Ho 


#### 5. An e-commerce company claims that the mean delivery time of food items on its website in NYC is 60 minutes with a standard deviation of 30 minutes. A random sample of 45 customers ordered from the website, and the average time for delivery was found to be 75 minutes. Is this enough evidence to claim that the average time to get items delivered is more than 60 minutes. (assume the normality of the data). Test the with α = 0.05.

In [20]:
# Ho: mu<=60
# Ha: mu>60

<a id="error"></a>
# Errors in Hypothesis Testing

Under the hytpothesis testing framework there can be two decisions; either we 'reject' the null hypotyhesis H<sub>0</sub> or 'not reject' the null hypothesis H<sub>0</sub>. Errors can still occur in a hypothesis testing scenario because the decision to reject or not reject the null hypothesis is made on the basis of data taken from a sample. Two types of errors are `Type I` and `Type II` error.

### Type I Error

This kind of error occurs when we reject the null hypothesis even if it is true. It is equivalent to a `false positive` conclusion. The maximum probability of committing a type I error is given by the value of $\alpha$, level of significance. So when $\alpha$ = 0.05 there is a 5% chance of rejecting a true null hypothesis.

### Type II Error

This kind of error occurs when we fail to reject the null hypothesis even if it is wrong. It is equivalent to a `false negative` conclusion. The probability of type II error is given by the value of $\beta$. The value of $\beta$ cannot be easily calculated. But $\alpha$ and $\beta$ are thus related that decreasing one increases the other.