# Pearson’s Correlation Coefficient

Tests whether two samples have a linear relationship.

Assumptions

Observations in each sample are independent and identically distributed (iid).
Observations in each sample are normally distributed.
Observations in each sample have the same variance.

Interpretation

H0: the two samples are independent.
H1: there is a dependency between the samples.

In [1]:

from scipy.stats import pearsonr
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869] 
data2 = [0.353, 3.517, 0.125, -7.545, -0.555, -1.536, 3.350, -1.578, -3.537, -1.579] 
stat, p = pearsonr(data1, data2)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably independent')
else:
	print('Probably dependent')

stat=0.688, p=0.028
Probably dependent


# Chi-Squared Test

Tests whether two categorical variables are related or independent.


Assumptions

Observations used in the calculation of the contingency table are independent.


Interpretation

H0: the two samples are independent.
H1: there is a dependency between the samples.

In [4]:
from scipy.stats import chi2_contingency   
table = [[10, 20, 30],[6,  9,  17],[5,10,20]] 
a = [[10,20,30],[1,0.2,3]]
stat, p, dof, expected = chi2_contingency(table) 
print('stat=%f, p=%f' % (stat, p))   
if p > 0.05:    
	print('Probably independent') 
else:
	print('Probably dependent') 

stat=0.673769, p=0.954524
Probably independent


In [2]:
table

[[10, 20, 30], [6, 9, 17]]

# Student’s t-test

In [None]:
Tests whether the means of two independent samples are significantly different.

Assumptions

Observations in each sample are independent and identically distributed (iid).
Observations in each sample are normally distributed.
Observations in each sample have the same variance.

Interpretation

H0: the means of the samples are equal.
H1: the means of the samples are unequal.

In [4]:

from scipy.stats import ttest_ind
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
stat, p = ttest_ind(data1, data2)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')

stat=-0.326, p=0.748
Probably the same distribution


# Paired Student’s t-test

In [None]:
Tests whether the means of two paired samples are significantly different.

Assumptions

Observations in each sample are independent and identically distributed (iid).
Observations in each sample are normally distributed.
Observations in each sample have the same variance.
Observations across each sample are paired.

Interpretation

H0: the means of the samples are equal.
H1: the means of the samples are unequal.

In [5]:
from scipy.stats import ttest_rel
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
stat, p = ttest_rel(data1, data2)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')

stat=-0.334, p=0.746
Probably the same distribution


# Analysis of Variance Test (ANOVA)

In [None]:
Tests whether the means of two or more independent samples are significantly different.

Assumptions

Observations in each sample are independent and identically distributed (iid).
Observations in each sample are normally distributed.
Observations in each sample have the same variance.

Interpretation

H0: the means of the samples are equal.
H1: one or more of the means of the samples are unequal.

In [6]:

from scipy.stats import f_oneway
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
data3 = [-0.208, 0.696, 0.928, -1.148, -0.213, 0.229, 0.137, 0.269, -0.870, -1.204]
stat, p = f_oneway(data1, data2, data3)
print('stat=%.3f, p=%.3f' % (stat, p))  
if p > 0.05:
	print('Probably the same distribution')
else:
	print('Probably different distributions')

stat=0.096, p=0.908
Probably the same distribution


 T - TEST : one sample
 
Example:
    For example, a ball has a diameter of 5 cm and we want to check whether the average diameter of the ball from the random sample (e.g. 50 balls) picked from the production line differs from the known size.
    
    
Assumptions
Dependent variable should have an approximately normal distribution (Shapiro-Wilks Test)
Observations are independent of each other


Hypotheses
Null hypothesis: Sample mean is equal to the hypothesized or known population mean
Alternative hypothesis: Sample mean is not equal to the hypothesized or known population mean (two-tailed or two-sided)
Alternative hypothesis: Sample mean is either greater or lesser to the hypothesized or known population mean (one-tailed or one-sided)

In [5]:
from bioinfokit.analys import get_data, stat 

In [7]:
df = get_data('t_one_samp').data  
df.head()                       

Unnamed: 0,size
0,5.739987
1,5.254042
2,5.152388
3,4.870819
4,3.536251


In [8]:
res = stat()
res.ttest(df=df, test_type=1, res='size', mu=5) 
print(res.summary)


One Sample t-test 

------------------  --------
Sample size         50
Mean                 5.05128
t                    0.36789
Df                  49
P-value (one-tail)   0.35727
P-value (two-tail)   0.71454
Lower 95.0%          4.77116
Upper 95.0%          5.3314
------------------  --------


Interpretation

The p value obtained from the one sample t-test is not significant (p > 0.05), and therefore, we conclude that the average diameter of the balls in a random sample is equal to 5 cm.

In [12]:
# two sample T test  

Two Sample independent t-test Used to compare the means of two independent groups

example:
    For example, we have two different plant genotypes (genotype A and genotype B) and would like to compare if the yield of genotype A is significantly different from genotype B
    
Null hypothesis: Two group means are equal
Alternative hypothesis: Two group means are different (two-tailed or two-sided)
Alternative hypothesis: Mean of one group either greater or lesser than another group (one-tailed or one-sided)

Two sample t-test 
Observations in two groups have an approximately normal distribution 
Homogeneity of variances (variances are equal between treatment groups) 
The two groups are sampled independently from each other from the same population


In [6]:
from bioinfokit.analys import get_data, stat    
df = get_data('t_ind_samp').data    
df  

Unnamed: 0,Genotype,yield
0,A,78.0
1,A,84.3
2,A,81.0
3,B,88.0
4,B,92.0
5,B,84.1
6,A,74.5
7,A,77.8
8,A,79.0
9,B,88.0


In [16]:
res = stat()
res.ttest(df=df, xfac="Genotype", res="yield", test_type=2)
print(res.summary)


Two sample t-test with equal variance

------------------  -------------
Mean diff           -10.3
t                    -5.40709
Std Error             1.90491
df                   10
P-value (one-tail)    0.000149204
P-value (two-tail)    0.000298408
Lower 95.0%         -14.5444
Upper 95.0%          -6.05561
------------------  -------------

Parameter estimates

Level      Number    Mean    Std Dev    Std Error    Lower 95.0%    Upper 95.0%
-------  --------  ------  ---------  -----------  -------------  -------------
A               6    79.1    3.30817      1.35056        75.6283        82.5717
B               6    89.4    3.29059      1.34338        85.9467        92.8533



The p value obtained from the t-test is significant (p < 0.05), and therefore, we conclude that the yield of genotype A is significantly different than genotype B.

# Paired t-test (dependent t-test)

Paired t-test used to compare the differences between the pair of dependent variables for the same subject
For example, we have plant variety A and would like to compare the yield of A before and after the application of some fertilizer

Note: Paired t-test is a one sample t-test on the differences between the two dependent variables

Paired t-test Hypotheses
Null hypothesis: There is no difference between the two dependent variables (difference=0)
Alternative hypothesis: There is a difference between the two dependent variables (two-tailed or two-sided)
Alternative hypothesis: Difference between two response variables either greater or lesser than zero (one-tailed or one-sided)



Paired t-test Assumptions
Differences between the two dependent variables follows an approximately normal distribution 
Differences between the two dependent variables should not have outliers
Observations are sampled independently from each other



In [18]:
from bioinfokit.analys import get_data, stat


In [19]:
df = get_data('t_pair').data
df.head()

Unnamed: 0,BF,AF
0,44.41,47.99
1,46.29,56.64
2,45.98,48.9
3,43.35,49.01
4,45.75,48.41


In [20]:
res = stat()
res.ttest(df=df, res=['AF', 'BF'], test_type=3)

In [21]:
print(res.summary)


Paired t-test 

------------------  ------------
Sample size         65
Difference Mean      5.55262
t                   14.2173
Df                  64
P-value (one-tail)   8.87966e-22
P-value (two-tail)   1.77593e-21
Lower 95.0%          4.7724
Upper 95.0%          6.33283
------------------  ------------


The p value obtained from the t-test is significant (p < 0.05), and therefore, we conclude that the yield of plant variety A significantly increased by the application of fertilizer.

In [None]:
Z-test

In [1]:

import math
import numpy as np
from numpy.random import randn
from statsmodels.stats.weightstats import ztest
  
mean_iq = 110
sd_iq = 15/math.sqrt(50)
alpha =0.05
null_mean =100
data = sd_iq*randn(50)+mean_iq

print('mean=%.2f stdv=%.2f' % (np.mean(data), np.std(data)))
  
ztest_Score, p_value= ztest(data,value = null_mean, alternative='larger')

if(p_value <  alpha):
    print("Reject Null Hypothesis")
else:
    print("Fail to Reject NUll Hypothesis")

mean=109.76 stdv=2.22
Reject Null Hypothesis
