<a id="lib"></a>
# 1. Import Libraries

**Let us import the required libraries.**

In [2]:
# import 'pandas' 
import pandas as pd 

# import 'numpy' 
import numpy as np

# import subpackage of matplotlib
import matplotlib.pyplot as plt

# import 'seaborn'
import seaborn as sns

# to suppress warnings 
from warnings import filterwarnings
filterwarnings('ignore')

# import 'random' to generate random sample
import random

# import statistics to perform statistical computation  
import statistics

# import 'stats' package from scipy library
from scipy import stats

# import a library to perform Z-test
from statsmodels.stats import weightstats as stests

# to test the normality 
from scipy.stats import shapiro

# import the function to calculate the power of test
from statsmodels.stats import power

<a id="t"></a>
# 3. t Test

<a id="1t"></a>
## 3.1 One Sample t Test

Let us perform a one sample t-test for the population mean. We compare the population mean with a specific value. 

The null and alternative hypothesis is given as:

<p style='text-indent:25em'> <strong> $H_{0}: \mu = \mu_{0}$ or $\mu \geq \mu_{0}$ or $\mu \leq \mu_{0}$</strong></p>
<p style='text-indent:25em'> <strong> $H_{1}: \mu \neq \mu_{0}$ or $\mu < \mu_{0}$ or $\mu > \mu_{0}$</strong></p>

The test statistic is given as:
<p style='text-indent:25em'> <strong> $t = \frac{\overline{X} -  \mu_{0}}{\frac{s}{\sqrt(n)}}$</strong></p>

Where, <br>
$\overline{X}$: Sample mean<br>
$s$: Sample standard deviation<br>
$n$: Sample size
 
Under $H_{0}$ the test statistic follows a t-distribution with n-1 degrees of freedom.


#### 1. A survey claims that in a math test female students tend to score marks greater than 75. Consider a sample of 24 female students and perform a hypothesis test to check the claim with 90% confidence.

Use the dataset available in the CSV file `mathscore_1ttest.csv`.

In [3]:
df = pd.read_csv('mathscore_1ttest.csv')
df.head()

Unnamed: 0,gender,race/ethnicity,lunch,test preparation course,math score,reading score,writing score,total score,training institute
0,female,group C,standard,none,60,72,74,206,Nature Learning
1,female,group C,standard,none,59,72,68,199,Nature Learning
2,female,group E,standard,none,100,100,100,300,Speak Global Learning
3,female,group D,standard,none,69,74,74,217,Speak Global Learning
4,female,group A,free/reduced,none,47,59,50,156,Speak Global Learning


In [21]:
fem_math= df[df['gender']=='female']['math score']


In [9]:
# Test of Normality - Shapiro  test
# Ho: skew=0 (normal)
# Ha : skew!=0 (not normal)

In [10]:
stats.shapiro(fem_math)

ShapiroResult(statistic=0.9368310570716858, pvalue=0.13859796524047852)

In [None]:
# pval =0.13
# sig lvl = 0.05
# pval> sig lvl . Ho is selected
# skew = 0 (Data is normal)

In [12]:
x_bar = np.mean(fem_math)
s = np.std(fem_math,ddof=1)
print(x_bar,s)

66.45833333333333 11.602020688568478


In [13]:
# Hypothesis

# Ho : mu >=75
# Ha : mu <75  (x_bar< mu)


In [None]:
# Data is normal
# pop std is not known
# one sample t test(left tailed)

In [15]:
x_bar = np.mean(fem_math)
s = np.std(fem_math,ddof=1)
mu = 75
n  = len(fem_math)
tstat = (x_bar - mu)/(s/n**0.5)
print('Test statistics:',tstat)
pval = stats.t.cdf(tstat,df=n-1)
print('pval:',pval)

Test statistics: -3.6067380757023204
pval: 0.0007426613957678669


In [20]:
# Inbuilt function:  

# by default : two sided probability
tstat,twosid_pval = stats.ttest_1samp(fem_math,popmean=75)
onesid_pval = twosid_pval/2
print(onesid_pval)

0.0007426613957678669


In [None]:
# pval = 0.0007
# sig lvl = 0.1
# pval < sig lvl
# Null hypothesis is rejected. Alternate is selected
# Average female score is less than 75

#### 2. A researcher is studying the growth of bacteria in waters of Lake Beach. The mean bacteria count of 100 per unit volume of water is within the safety level. The researcher collected 10 water samples of unit volume and found the mean bacteria count to be 94.8 with a sample variance of 72.66. Does the data indicate that the bacteria count is within the safety level? Test at the α = .05 level. Assume that the measurements constitute a sample from a normal population.

In [13]:
# Hypothesis

# Ho : mu >=100
# Ha : mu <100  (x_bar< mu)


In [None]:
# Data is normal
# pop std is not known
# one sample t test(left tailed)

In [23]:
x_bar = 94.8
s = 72.66**0.5
mu = 100
n  = 10
tstat = (x_bar - mu)/(s/n**0.5)
print('Test statistics:',tstat)
pval = stats.t.cdf(tstat,df=n-1)
print('pval:',pval)

Test statistics: -1.9291040236750068
pval: 0.04289782134327503


In [24]:
# pval = 0.04
# sig lvl = 0.05
# pval < sig lvl
# Null hypothesis is rejected. Alternate is selected
# Average bacter level is less than 100

<a id="2t"></a>
## 3.2 Two Sample t Test (Unpaired)

The two sample t-test is used to compare the means of two independent populations. This test assumes that the populations are normally distributed from which the samples are taken.

The null and alternative hypothesis is given as:
<p style='text-indent:25em'> <strong> $H_{0}: \mu_{1} - \mu_{2} = \mu_{0}$ or $\mu_{1} - \mu_{2} \geq \mu_{0}$ or $\mu_{1} -\mu_{2} \leq \mu_{0}$</strong></p>
<p style='text-indent:25em'> <strong> $H_{1}: \mu_{1} - \mu_{2} \neq \mu_{0} $ or $\mu_{1} - \mu_{2} < \mu_{0}$ or $\mu_{1} -\mu_{2} > \mu_{0}$</strong></p>

Let us take a sample of size ($n_{1}$) from the first population and sample of size ($n_{2}$) from a second independent population. If both $n_{1}$ and $n_{2}$ are less than 30 and standard deviation of populations are unknown. We use two-sample t-test.

Consider the equal variance for both the populations. The test statistic for two sample t-test is given as:
<p style='text-indent:25em'> <strong> $t = \frac{(\overline{X_{1}} - \overline{X_{2}}) - \mu_{0}} {s \sqrt{\frac{1}{n_{1}} + \frac{1}{n_{2}}}}$</strong></p>

Where, <br>
$\overline{X_{1}}$, $\overline{X_{2}}$: Mean of both the samples<br>
$\mu_{0}$: Mean difference given in the null hypothesis<br>
$s$: Pooled standard deviation<br>
$n_{1}, n_{2}$: Size of samples from both the populations

The pooled standard deviation is defined as:
$s = \sqrt{\frac{(n_{1} - 1)s_{1}^{2} + (n_{2} - 1)s_{2}^{2}}{n_{1} + n_{2} - 2}}$ $\hspace{2cm}$  Where, $s_{1}, s_{2}$: Standard deviation of both the samples

Under $H_{0}$, the test statistic follows a t-distribution with $(n_{1}+n_{2}-2)$ degrees of freedom.

If the population variances are equal and also the sample size is the same for both the samples then the test statistic is given as:
<p style='text-indent:25em'> <strong> $t = \frac{(\overline{X_{1}} - \overline{X_{2}}) - \mu_{0}} {s \sqrt{\frac{2}{n}}}$</strong></p>

Where the pooled standard deviation $s = \sqrt{\frac{s_{1}^{2} + s_{2}^{2}}{2}}$

Under $H_{0}$, the test statistic follows a t-distribution with $(2n-2)$ degrees of freedom.

If both the population variances and the sample sizes are not equal then the Welch's test is used.

### Example: 

#### 1. The teachers' association claims that the total score of the students who completed the test preparation course is different than the total score of the students who have not completed the course. The sample data consists of 15 students who completed the course and 18 students who have not completed the course. Test the association's claim with ⍺ = 0.05.

Consider the total score of the students who have/ have not completed the preparation course are given in the CSV file `totalmarks_2ttest.csv`.

In [25]:
df = pd.read_csv('totalmarks_2ttest.csv')
df.head()

Unnamed: 0,gender,race/ethnicity,lunch,test preparation course,math score,reading score,writing score,total score,training institute
0,male,group E,standard,completed,84,83,78,245,Speak Global Learning
1,male,group C,free/reduced,completed,79,77,75,231,Speak Global Learning
2,male,group A,standard,none,91,96,92,279,Nature Learning
3,female,group B,free/reduced,completed,76,94,87,257,Speak Global Learning
4,male,group A,standard,completed,46,41,43,130,Nature Learning


In [34]:
completed = df[df['test preparation course']=='completed']['total score']
notcompleted =df[df['test preparation course']=='none']['total score']

In [35]:
# Test of Normality - Shapiro  test
# Ho: skew=0 (normal)
# Ha : skew!=0 (not normal)

In [36]:
stats.shapiro(completed)

ShapiroResult(statistic=0.9055534601211548, pvalue=0.11574020981788635)

In [37]:
# pval =0.11
# sig lvl = 0.05
# pval> sig lvl . Ho is selected
# skew = 0 (Data is normal)

In [38]:
stats.shapiro(notcompleted)

ShapiroResult(statistic=0.948186457157135, pvalue=0.39728137850761414)

In [None]:
# pval =0.39
# sig lvl = 0.05
# pval> sig lvl . Ho is selected
# skew = 0 (Data is normal)

In [39]:
x1_bar  = np.mean(completed)
x2_bar = np.mean(notcompleted)
print(x1_bar,x2_bar)

213.86666666666667 196.5


In [42]:
# Hypothesis:

# Ho : mu of completed = mu of not completed
# Ha : mu of completed != mu of not completed

In [None]:
# Data is normal
# pop std is not known
# pop is independant
# Unpaired two sample t test (two tailed)

In [41]:
ttstat,twosid_pval = stats.ttest_ind(completed,notcompleted)
print('T stat:',ttstat)
print('Two sided Pval:',twosid_pval)

T stat: 1.4385323319823262
Two sided Pval: 0.16030339806989594


In [None]:
# pval = 0.16
# sig lvl = 0.05
# pval > sig lvl
# Null hypothesis is selected
# No difference is perfomace of completed and not completed.

## Practice:

1. The teachers' association claims that the total score of Speak Global Learning is greater than the total score of Nature Learning.  Test the association's claim with ⍺ = 0.05.

In [44]:
sgl = df[df['training institute']=='Speak Global Learning']['total score']
nl =  df[df['training institute']=='Nature Learning']['total score']

In [35]:
# Test of Normality - Shapiro  test
# Ho: skew=0 (normal)
# Ha : skew!=0 (not normal)

In [45]:
stats.shapiro(sgl)

ShapiroResult(statistic=0.940517246723175, pvalue=0.26940712332725525)

In [37]:
# pval =0.26
# sig lvl = 0.05
# pval> sig lvl . Ho is selected
# skew = 0 (Data is normal)

In [46]:
stats.shapiro(nl)

ShapiroResult(statistic=0.960299015045166, pvalue=0.7280198335647583)

In [47]:
# pval =0.72
# sig lvl = 0.05
# pval> sig lvl . Ho is selected
# skew = 0 (Data is normal)

In [48]:
x1_bar  = np.mean(sgl)
x2_bar = np.mean(nl)
print(x1_bar,x2_bar)

209.6315789473684 197.28571428571428


In [49]:
# Hypothesis:

# Ho : mu of sgl <= mu of not nl
# Ha : mu of sgl > mu of not nl

In [None]:
# Data is normal
# pop std is not known
# pop is independant
# Unpaired two sample t test (right tailed)

In [52]:
ttstat,twosid_pval = stats.ttest_ind(sgl,nl)
print('T stat:',ttstat)
print('Two sided Pval:',twosid_pval)
print('One sided pval:',twosid_pval/2)

T stat: 0.9984458677537893
Two sided Pval: 0.32579344760218754
One sided pval: 0.16289672380109377


In [None]:
# pval = 0.162
# sig lvl = 0.05
# pval > sig lvl
# Null hypothesis is selected
# score of sgl is lesser than nl.

<a id="paired"></a>
## 3.3 Paired t Test

A paired t-test is used to compare the mean of the population for two dependent samples. The dependent samples can be the scores before and after a specific treatment. 

Let $X_{i}$ be the sample before the treatment and $Y_{i}$ be the sample after the treatment. Let $\mu_{X}$, $\mu_{Y}$ be the mean of the data X and Y respectively. The mean difference $\mu_{d} = \mu_{Y} - \mu_{X}$.

The null and alternative hypothesis is given as:

<p style='text-indent:25em'> <strong> $H_{0}: \mu_{d} = \mu_{0}$ or $\mu_{d} \geq \mu_{0}$ or $\mu_{d} \leq \mu_{0}$</strong></p>
<p style='text-indent:25em'> <strong> $H_{1}: \mu_{d} \neq \mu_{0}$ or $\mu_{d} < \mu_{0}$ or $\mu_{d} > \mu_{0}$</strong></p>

The test statistic for paired t-test is given as:
<p style='text-indent:25em'> <strong> $t = \frac{\overline{X_{D}} - \mu_{0}} {\frac{s_{D}}{\sqrt{n}}}$</strong></p>

Where, <br>
$\overline{X_{D}}$: Mean difference between the pairs<br>
$\mu_{0}$: Mean difference given in the null hypothesis<br>
$s_{D}$: Standard deviation of differences between the pairs<br>
$n$: Sample size

Under $H_{0}$, the test statistic follows a t-distribution with (n-1) degrees of freedom.

### Example:

#### 1. A training institute wants to check if their writing training program was effective or not. 17 students are selected to check the hypothesis. Consider 0.05 as the level of significance.

The writing scores before and after training are provided in the CSV file `WritingScores.csv`. 

In [54]:
df = pd.read_csv('WritingScores.csv')
df.head()

Unnamed: 0,score_before,score_after
0,59,50
1,62,67
2,76,92
3,32,75
4,61,98


In [55]:
bef_scr = df['score_before']
aft_scr = df['score_after']

In [35]:
# Test of Normality - Shapiro  test
# Ho: skew=0 (normal)
# Ha : skew!=0 (not normal)

In [56]:
stats.shapiro(bef_scr)

ShapiroResult(statistic=0.9473825097084045, pvalue=0.416460782289505)

In [37]:
# pval =0.41
# sig lvl = 0.05
# pval> sig lvl . Ho is selected
# skew = 0 (Data is normal)

In [57]:
stats.shapiro(aft_scr)

ShapiroResult(statistic=0.9686523675918579, pvalue=0.7944130897521973)

In [58]:
# pval =0.79
# sig lvl = 0.05
# pval> sig lvl . Ho is selected
# skew = 0 (Data is normal)

In [59]:
x1_bar  = np.mean(bef_scr)
x2_bar = np.mean(aft_scr)
print(x1_bar,x2_bar)

67.05882352941177 73.41176470588235


In [49]:
# Hypothesis:

# Ho : mu of before  >=  mu of after
# Ha : mu of before  <  mu of after

In [None]:
# Data is normal
# pop std is not known
# pop is dependant
# paired two sample t test (left tailed)

In [60]:
ttstat,twosid_pval = stats.ttest_rel(bef_scr,aft_scr)
print('T stat:',ttstat)
print('Two sided Pval:',twosid_pval)
print('One sided pval:',twosid_pval/2)

T stat: -1.4394882729049499
Two sided Pval: 0.16929012896279846
One sided pval: 0.08464506448139923


In [64]:
# ALiter: 
diff = bef_scr-aft_scr
ttstat,twosid_pval = stats.ttest_1samp(diff,popmean=0)
print('T stat:',ttstat)
print('Two sided Pval:',twosid_pval)
print('One sided pval:',twosid_pval/2)

T stat: -1.4394882729049499
Two sided Pval: 0.16929012896279846
One sided pval: 0.08464506448139923


In [61]:
# pval = 0.08
# sig lvl = 0.05
# pval > sig lvl
# Null hypothesis is selected
# score of before is greater than after.

#### 2. An energy drink distributor claims that a new advertisement poster, featuring a life-size picture of a well-known athlete, will increase the product sales in outlets by an average of 50 bottles in a week. For a random sample of 10 outlets, the following data was collected. Test that the null hypothesis that there the advertisement was effective in increasing sales. Test the hypothesis using critical region technique. Use α = 0.05.

Given data:

        sales_before = [33, 32, 38, 45, 37, 47, 48, 41, 45]
        sales_after = [42, 35, 31, 41, 37, 36, 49, 49, 48]

<a id="prop"></a>
# 4. Z Proportion Test

<a id="1_p"></a>
## 4.1 One Sample Test

Perform one sample Z test for the population proportion. We compare the population proportion ($P$) with a specific value ($P_{0}$).

The null and alternative hypothesis is given as:

<p style='text-indent:25em'> <strong> $H_{0}: P = P_{0}$ or $P \geq P_{0}$ or $P \leq P_{0}$</strong></p>
<p style='text-indent:25em'> <strong> $H_{1}: P \neq P_{0}$ or $P < P_{0}$ or $P > P_{0}$</strong></p>

The test statistic for proportion Z-test is given as:
<p style='text-indent:25em'> <strong> $Z = \frac{p -  P_{0}}{\sqrt{\frac{P_{0}(1-P_{0})}{n}}}$</strong></p>

Where, <br>
$p$: Sample proportion<br>
$n$: Sample size

Under $H_{0}$, the test statistic follows a standard normal distribution.

### Example:

#### 1. In previous years, people believed that at most 80% of male students score more than 50 marks out of 100 in Mathematics. Perform a test to check whether this percentage is more than 80. Consider the level of significance as 0.05.

Consider the sample of math scores of male students available in the CSV file `StudentsPerformance.csv`.

In [69]:
df =pd.read_csv('StudentsPerformance.csv')
df.head()

Unnamed: 0,gender,race/ethnicity,lunch,test preparation course,math score,reading score,writing score,total score,training institute
0,female,group B,standard,none,89,55,56,200,Nature Learning
1,female,group C,standard,completed,55,63,72,190,Nature Learning
2,female,group B,standard,none,64,71,56,191,Nature Learning
3,male,group A,free/reduced,none,60,99,72,231,Nature Learning
4,male,group C,standard,none,75,66,51,192,Nature Learning


In [77]:
male_gr_50 = df[(df['gender']=='male')&(df['math score']>50)]
male =df[df['gender']=='male']
p_sam = len(male_gr_50)/len(male)
print(p_sam)

0.8757763975155279


In [76]:
len(male),len(male_gr_50)

(483, 423)

In [78]:
# Ho : p <= 0.8
# Ha : p>0.8

In [82]:
p_sam = len(male_gr_50)/len(male)
p_pop = 0.8
n = len(male)
num = p_sam-p_pop
den = np.sqrt((p_sam*(1-p_sam))/n)
zstat = num/den
print('Test stat:',zstat)
pval = stats.norm.sf(zstat)
print('pval:',pval)

Test stat: 5.049040355417874
pval: 2.2201746989024182e-07


In [None]:
# pval = 0
# sig lvl = 0.05
# pval < sig.lvl
# Null hypothesis is rejected. Ha is selected
# proportion is greater than 0.8

#### A claim states 70% people opts of LED Tv. A sample if 100 people is choosen and found tha 65% of people are opting LED tv. Test the hypothesis with 95% CI.

#### 2. From a sample of 361 business owners had gone into bankruptcy due to recession. On taking a survey, it was found that 105 of them had not consulted any professional for managing their finance before opening the business. Test the null hypothesis that at most 25% of all businesses had not consulted before opening the business. Test the claim using p-value technique. Use α = 0.05.

<a id="2_p"></a>
## 4.2 Two Sample Test

Perform two sample Z test for the population proportion. We check the equality of population proportions $P_{1}$ and $P_{2}$.

The null and alternative hypothesis is given as:

<p style='text-indent:25em'> <strong> $H_{0}: P_{1} - P_{2} = P_{0}$ or $P_{1} - P_{2} \geq P_{0}$ or $P_{1} - P_{2} \leq P_{0}$</strong></p>
<p style='text-indent:25em'> <strong> $H_{1}: P_{1} - P_{2} \neq P_{0}$ or $P_{1} - P_{2} < P_{0}$ or $P_{1} - P_{2} > P_{0}$</strong></p>

The test statistic for two sample proportion Z-test is given as:
<p style='text-indent:25em'> <strong> $Z = \frac{(p_{1} -  p_{2}) - P_{0}}{\sqrt{\bar{P}(1-\bar{P})(\frac{1}{n_{1}} + \frac{1}{n_{2}})}}$   $\hspace{2 cm} \bar{P} = \frac{n_{1}p_{1} + n_{2}p_{2}}{n_{1} + n_{2}}$ </strong></p>

Where, <br>
$p_{1}, p_{2}$: Samples proportions<br>
$P_{0}$: Hypothesized proportion<br>
$\bar{P}$: Proportion of pooled sample<br>
$n_{1}, n_{2}$: Samples sizes

### Example:

#### 1. A team of nutritionists believes that each institute provides 'standard' lunch to an equal proportion of students. A sample of students from institutes <i>Nature Learning</i> and <i>Speak Global Learning</i> is given. Consider the null hypothesis as equality of proportion with 0.1 level of significance.

Consider the sample data available in the CSV file `StudentsPerformance.csv`.

In [83]:
# read the students performance data 
df = pd.read_csv('StudentsPerformance.csv')

# display the first two observations
df.head(5)

Unnamed: 0,gender,race/ethnicity,lunch,test preparation course,math score,reading score,writing score,total score,training institute
0,female,group B,standard,none,89,55,56,200,Nature Learning
1,female,group C,standard,completed,55,63,72,190,Nature Learning
2,female,group B,standard,none,64,71,56,191,Nature Learning
3,male,group A,free/reduced,none,60,99,72,231,Nature Learning
4,male,group C,standard,none,75,66,51,192,Nature Learning


In [96]:
nl = df[df['training institute']=='Nature Learning']
nl_std = nl[nl['lunch']=='standard']
nl_count = len(nl)
nl_std_count = len(nl_std)
p_nl = nl_std_count/nl_count

In [97]:
sgl = df[df['training institute']=='Speak Global Learning']
sgl_std = sgl[sgl['lunch']=='standard']
sgl_count = len(sgl)
sgl_std_count = len(sgl_std)
p_sgl = sgl_std_count/sgl_count

In [98]:
# Ho : p1=p2
# Ha : p1!=p2

In [99]:
from statsmodels.stats import proportion

In [102]:
zstat,pvalue=proportion.proportions_ztest(count=[nl_std_count,sgl_std_count]
                             ,nobs=[nl_count,sgl_count])

# Count - favorable count
# nobs - total count

print('Zstat:',zstat)
print('Pval:',pvalue)

Zstat: 0.7935300106078008
Pval: 0.4274690915859791


In [103]:
# pval=0.42
# sig lvl =0.01
# pval > sig lvl
# Ho is selected.
# proportions are equal.

#### 2. Steve owns a kiosk where he sells two magazines - A and B in a month. He buys 100 copies of magazine A out of which 78 were sold and 70 copies of magazine B out of which 65 were sold. Is there enough evidence to say that magazine is B is more popular? Test the claim using p-value technique with α = 0.05.

In [104]:
a_count = 100
a_sold_count =78
b_count = 70
b_sold_count =65
a_prop = a_sold_count/a_count
b_prop = b_sold_count/b_count
print(a_prop,b_prop)

0.78 0.9285714285714286


In [105]:
# Ho :  p1>=p2
# Ha  : p1<p2

In [107]:
zstat,twosid_pvalue=proportion.proportions_ztest(count=[a_sold_count,b_sold_count]
                             ,nobs=[a_count,b_count])

# Count - favorable count
# nobs - total count

print('Zstat:',zstat)
print('Pval:',twosid_pvalue/2)



Zstat: -2.60830803458311
Pval: 0.004549551600547303


In [None]:
# pval=0.004
# sig lvl =0.05
# pval < sig lvl
# Ho is rejected.
# proportions A magazine is lesser than B magazine.

<a id="non_para"></a>
# 5. Non-parametric Tests

Parametric tests are the test in which the distribution of the sample is known. The non-parametric tests can be used when the assumptions of parametric tests are not satisfied.

`Non-parametric tests` do not require any assumptions about the distribution of the population from which the sample is taken. These tests can be applied to the ordinal/ nominal data. A non-parametric test can be performed on the data containing outliers. The observations in the sample are assumed to be independent for a non-parametric test.

<a id="1samp"></a>
## 5.1 Wilcoxon Signed Rank Test

### 1. One-sample Test

Wilcoxon signed rank test is used to compare the median (M) of a sample to a specific value ($M_{0}$). This test is a non-parametric alternative to the one-sample t-test which is used to compare the mean of population with a specific value.

The null and alternative hypothesis is given as:

<p style='text-indent:25em'> <strong> $H_{0}: M = M_{0}$ or $M \geq M_{0}$ or $M \leq M_{0}$</strong></p>
<p style='text-indent:25em'> <strong> $H_{1}: M \neq M_{0}$ or $M < M_{0}$ or $M > M_{0}$</strong></p>

To perform the test, arrange the sample into ascending order and calculate the difference between the sample point and $M_{0}$. Rank the absolute value of differences using the integers starting from 1 giving the average of ranks to the tied difference. 

### Example:

#### 1. The Sweet Life company that produces hand sanitizers states sanitizer contains average volume of   alcohol in 0.82.Perfrom hypothesis testing with   α = 0.1

Given data:

        alc_per = [0.32, 0.43, 0.38, 0.35, 0.85, 0.79]

In [108]:
alc_per = [0.32, 0.43, 0.38, 0.35, 0.85, 0.79]

In [35]:
# Test of Normality - Shapiro  test
# Ho: skew=0 (normal)
# Ha : skew!=0 (not normal)

In [109]:
stats.shapiro(alc_per)

ShapiroResult(statistic=0.7890822291374207, pvalue=0.04677129536867142)

In [37]:
# pval =0.04
# sig lvl = 0.05
# pval< sig lvl . Ha is selected
# skew != 0 (Data is not normal)

In [110]:
# Ho : median of alcohol percent = 0.82
# Ha :  median of alcohol percent != 0.82

In [111]:
# Data is not normal
# One sample wilcoxon(Two tailed)

In [115]:
m0=0.82
diff = np.array(alc_per)-m0

In [117]:
tstat,pval = stats.wilcoxon(diff)
print('Test statistic:',tstat)
print('Pval:',pval)

Test statistic: 2.0
Pval: 0.09375


In [None]:
# pval = 0.09
# sig lvl =0.1
# pval < sig lvl
# Ho is rejected. Ha is selected
# Median of alcohol is not 0.82

### 2. Two-sample Paired Test

Wilcoxon signed rank test can be used to compare medians of paired data. This test is a non-parametric alternative to the paired t-test. Let us consider variables X and Y. The median of the difference between the two paired samples is denoted by $M_{d}$. Where the difference between two samples is given as, $d_{i} = x_{i} - y_{i}$



### Example:

#### 1. The weights (in kg) of five hens before and after a special diet of millets was given. Test the hypothesis that the new millet diet has increased the weight of the hens at a 5% level of significance.

        before_wt = [2.7, 1.1, 1.4, 0.9, 0.9] 
        after_wt = [1.3, 1.4, 1.1, 1.3, 1.9] 

In [118]:
before_wt = [2.7, 1.1, 1.4, 0.9, 0.9] 
after_wt = [1.3, 1.4, 1.1, 1.3, 1.9] 

In [35]:
# Test of Normality - Shapiro  test
# Ho: skew=0 (normal)
# Ha : skew!=0 (not normal)

In [119]:
stats.shapiro(before_wt)

ShapiroResult(statistic=0.7607763409614563, pvalue=0.037337224930524826)

In [37]:
# pval =0.03
# sig lvl = 0.05
# pval< sig lvl . Ha is selected
# skew != 0 (Data is not normal)

In [120]:
stats.shapiro(after_wt)

ShapiroResult(statistic=0.8582397699356079, pvalue=0.22199729084968567)

In [37]:
# pval =0.22
# sig lvl = 0.05
# pval > sig lvl . Ho is selected
# skew = 0 (Data is  normal)

In [None]:
# Any one sample is not normal. non parametric needs to be done.

In [121]:
np.median(before_wt),np.median(after_wt)

(1.1, 1.3)

In [None]:
# Ho : median of before weight >= median of after weight
# Ha : median of before weight < median of after weight

In [123]:
tstat,twosid_pval = stats.wilcoxon(before_wt,after_wt)
print('Test statistic:',tstat)
print('Pval:',twosid_pval/2)

Test statistic: 6.5
Pval: 0.40625


In [None]:
# pval = 0.40
# sig lvl =0.05
# pval > sig lvl
# Ho is not rejected. 
# Median of before weight is greater that after after. So program is not
# efficient.

<a id="m_w"></a>
## 5.2 Mann-Whitney U Test

It is a non-parametric test that compares the distributions of independent populations. This test can be used as a non-parametric alternative for the unpaired t-test. 
Consider a sample of size $n_{1}$ from a random variable X and another sample of size $n_{2}$ from a random variable Y.



### Examples:

#### 1. Two companies EyeCare and VisionFirst produces timolol eye drops. The sample of 5 bottles from both companies is selceted and the content of timolol maleate in milligram is recorded. Perform Mann-Whitney U test to test whether the amount of timolol maleate is different for both the companies. Use level of significance as 0.05.

Given data:

        eyecare = [6.18, 6.45, 6.21, 8.68, 8.45]
        visionfirst = [5.8, 7.8, 6.2, 5.9, 6.2]

In [124]:
eyecare = [6.18, 6.45, 6.21, 8.68, 8.45]
visionfirst = [5.8, 7.8, 6.2, 5.9, 6.2]

In [35]:
# Test of Normality - Shapiro  test
# Ho: skew=0 (normal)
# Ha : skew!=0 (not normal)

In [125]:
stats.shapiro(eyecare)

ShapiroResult(statistic=0.7654824256896973, pvalue=0.04114510864019394)

In [37]:
# pval =0.04
# sig lvl = 0.05
# pval< sig lvl . Ha is selected
# skew != 0 (Data is not normal)

In [126]:
stats.shapiro(visionfirst)

ShapiroResult(statistic=0.7419655323028564, pvalue=0.025034615769982338)

In [127]:
# pval =0.25
# sig lvl = 0.05
# pval > sig lvl . Ho is selected
# skew = 0 (Data is  normal)

In [128]:
# Any one sample is not normal. non parametric needs to be done.

In [129]:
np.median(eyecare),np.median(visionfirst)

(6.45, 6.2)

In [None]:
# Ho : median of eyecare = median of visionfirst
# Ha : median of eyecare != median of visionfirst

In [131]:
tstat,twosid_pval = stats.mannwhitneyu(eyecare,visionfirst)
print('Test statistic:',tstat)
print('Pval:',twosid_pval)

Test statistic: 5.0
Pval: 0.07122834869704933


In [None]:
# pval = 0.07
# sig lvl =0.05
# pval > sig lvl
# Ho is not rejected. 
# Median of eyecare timilol content is equal to median of vision care timilol 
# content.