# Hypothesis Testing
Hypothesis testing is a statistical method that is used in making statistical decisions using experimental data. Hypothesis Testing is basically an assumption that we make about the population parameter.

Examples: you say avg student in class is 40 or a boy is taller than girls.

Hypothesis Testing types:
1. T Test [2 types]
   1. One sample t-test
   2. Two samples t-test
   3. Paired samples t-test
2. Z Test
3. ANOVA Test
4. Chi-Square Test

## T Test

### One sample T-test

In [3]:
from scipy.stats import ttest_1samp
import numpy as np

In [5]:
ages = [23, 35, 40, 57, 30, 49, 52, 22, 31, 45]
print("Ages: ", ages)

Ages:  [23, 35, 40, 57, 30, 49, 52, 22, 31, 45]


In [6]:
ages_mean = np.mean(ages)
print("Mean of Ages: ", ages_mean)

Mean of Ages:  38.4


In [7]:
tset, pval = ttest_1samp(ages, 30)
print("p-values: ", pval)

p-values:  0.0568819292935687


In [8]:
if pval < 0.05:    # alpha = 0.05 or 5%
    print("We are rejecting null hypothesis ..!")
else:
    print("We are accepting null hypothesis ..!")

We are accepting null hypothesis ..!


### Two sampled T-test

In [9]:
from scipy.stats import ttest_ind
import numpy as np

In [11]:
week1 = np.random.randint(15, 40, size = 50)
week2 = np.random.randint(15, 40, size = 50)

print("Week 1: ", week1)
print("Week 2: ", week2)

Week 1:  [27 33 31 35 31 34 32 23 30 21 19 21 19 24 39 15 25 30 37 15 38 28 28 25
 19 33 33 31 30 33 23 25 24 28 24 31 18 37 37 36 20 31 23 20 22 30 17 16
 25 28]
Week 2:  [31 23 16 25 35 24 20 37 35 25 35 38 18 18 24 23 24 22 22 26 24 16 27 16
 31 38 17 35 15 16 26 28 22 17 39 31 34 35 37 28 20 20 39 16 32 37 22 20
 35 34]


In [12]:
week1_mean = np.mean(week1)
week2_mean = np.mean(week2)

print("Week 1 mean value: ", week1_mean)
print("Week 2 mean value: ", week2_mean)

Week 1 mean value:  27.08
Week 2 mean value:  26.56


In [13]:
week1_std = np.std(week1)
week2_std = np.std(week2)

print("Week 1 std value: ", week1_std)
print("Week 2 std value: ", week2_std)

Week 1 std value:  6.504890467947942
Week 2 std value:  7.566135076774667


In [14]:
ttest, pval = ttest_ind(week1, week2)
print("p-value: ", pval)

p-value:  0.7160441823485384


In [15]:
if pval < 0.05:
    print("We reject null hypothesis ..!")
else:
    print("We accept null hypothesis ..!")

We accept null hypothesis ..!


### Paired sampled t-test

In [16]:
import pandas as pd
from scipy import stats

In [17]:
df = pd.read_csv("dataset/blood_pressure.csv")
df.head()

Unnamed: 0,patient,sex,agegrp,bp_before,bp_after
0,1,Male,30-45,143,153
1,2,Male,30-45,163,170
2,3,Male,30-45,153,168
3,4,Male,30-45,153,142
4,5,Male,30-45,146,141


In [18]:
df[["bp_before", "bp_after"]].describe()

Unnamed: 0,bp_before,bp_after
count,120.0,120.0
mean,156.45,151.358333
std,11.389845,14.177622
min,138.0,125.0
25%,147.0,140.75
50%,154.5,149.5
75%,164.0,161.0
max,185.0,185.0


In [19]:
ttest, pval = stats.ttest_rel(df["bp_before"], df["bp_after"])
print("p-value: ", pval)

p-value:  0.0011297914644840823


In [20]:
if pval < 0.05:
    print("Reject null hypothesis ..!")
else:
    print("accept null hypothesis ..!")

Reject null hypothesis ..!


## Z Test

### One-sample Z Test

In [21]:
import pandas as pd
from scipy import stats
from statsmodels.stats import weightstats as stests

In [23]:
ztest, pval = stests.ztest(df["bp_before"], x2 = None, value = 156)
print("p-value: ", float(pval))

p-value:  0.6651614730255063


In [24]:
if pval < 0.05:
    print("reject null hypothesis ..!")
else:
    print("accept null hypothesis ..!")

accept null hypothesis ..!


### Two-sample Z Test

In [25]:
ztest, pval = stests.ztest(df["bp_before"],
                           x2 = df["bp_after"],
                           value = 0,
                           alternative = "two-sided")

print("p-value: ", float(pval))

p-value:  0.002162306611369422


In [26]:
if pval < 0.05:
    print("reject null hypothesis ..!")
else:
    print("accept null hypothesis ..!")

reject null hypothesis ..!


## ANOVA (F-Test) - Analysis of variance

### One Way F-Test (Anova)

In [29]:
df_anova = pd.read_csv("dataset/PlantGrowth.csv")
df_anova = df_anova[["weight", "group"]]

Unnamed: 0,weight,group
0,4.17,ctrl
1,5.58,ctrl
2,5.18,ctrl
3,6.11,ctrl
4,4.5,ctrl
5,4.61,ctrl
6,5.17,ctrl
7,4.53,ctrl
8,5.33,ctrl
9,5.14,ctrl


In [30]:
grps = pd.unique(df_anova.group.values)

d_data = {grp: df_anova["weight"][df_anova.group == grp] for grp in grps}
d_data

{'ctrl': 0    4.17
 1    5.58
 2    5.18
 3    6.11
 4    4.50
 5    4.61
 6    5.17
 7    4.53
 8    5.33
 9    5.14
 Name: weight, dtype: float64,
 'trt1': 10    4.81
 11    4.17
 12    4.41
 13    3.59
 14    5.87
 15    3.83
 16    6.03
 17    4.89
 18    4.32
 19    4.69
 Name: weight, dtype: float64,
 'trt2': 20    6.31
 21    5.12
 22    5.54
 23    5.50
 24    5.37
 25    5.29
 26    4.92
 27    6.15
 28    5.80
 29    5.26
 Name: weight, dtype: float64}

In [31]:
F, p = stats.f_oneway(d_data["ctrl"], d_data["trt1"], d_data["trt2"])

print("p-value for significance is: ", p)

p-value for significance is:  0.0159099583256229


In [32]:
if p < 0.05:
    print("reject null hypothesis")
else:
    print("accept null hypothesis")

reject null hypothesis


### Two Way F-Test

In [33]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [34]:
df_anova2 = pd.read_csv("https://raw.githubusercontent.com/Opensourcefordatascience/Data-sets/master/crop_yield.csv")

In [35]:
df_anova2

Unnamed: 0,Fert,Water,Yield
0,A,High,27.4
1,A,High,33.6
2,A,High,29.8
3,A,High,35.2
4,A,High,33.0
5,B,High,34.8
6,B,High,27.0
7,B,High,30.2
8,B,High,30.8
9,B,High,26.4


In [36]:
model = ols("Yield ~ C(Fert)*C(Water)", df_anova2).fit()

In [37]:
print(f"Overall model F({model.df_model: .0f}, {model.df_resid: .0f}) = {model.fvalue: .3f}, p = {model.f_pvalue: .4f}")

Overall model F( 3,  16) =  4.112, p =  0.0243


In [38]:
res = sm.stats.anova_lm(model, typ = 2)
res

Unnamed: 0,sum_sq,df,F,PR(>F)
C(Fert),69.192,1.0,5.766,0.028847
C(Water),63.368,1.0,5.280667,0.035386
C(Fert):C(Water),15.488,1.0,1.290667,0.272656
Residual,192.0,16.0,,


## Chi-Square Test

In [40]:
df_chi = pd.read_csv('dataset/chi-test.csv')
contingency_table=pd.crosstab(df_chi["Gender"],df_chi["Like Shopping?"])
print('contingency_table :-\n',contingency_table)
#Observed Values
Observed_Values = contingency_table.values
print("Observed Values :-\n",Observed_Values)
b=stats.chi2_contingency(contingency_table)
Expected_Values = b[3]
print("Expected Values :-\n",Expected_Values)
no_of_rows=len(contingency_table.iloc[0:2,0])
no_of_columns=len(contingency_table.iloc[0,0:2])
ddof=(no_of_rows-1)*(no_of_columns-1)
print("Degree of Freedom:-",ddof)
alpha = 0.05
from scipy.stats import chi2
chi_square=sum([(o-e)**2./e for o,e in zip(Observed_Values,Expected_Values)])
chi_square_statistic=chi_square[0]+chi_square[1]
print("chi-square statistic:-",chi_square_statistic)
critical_value=chi2.ppf(q=1-alpha,df=ddof)
print('critical_value:',critical_value)
#p-value
p_value=1-chi2.cdf(x=chi_square_statistic,df=ddof)
print('p-value:',p_value)
print('Significance level: ',alpha)
print('Degree of Freedom: ',ddof)
print('chi-square statistic:',chi_square_statistic)
print('critical_value:',critical_value)
print('p-value:',p_value)
if chi_square_statistic>=critical_value:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")

if p_value<=alpha:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")

contingency_table :-
 Like Shopping?  No  Yes
Gender                 
Female           2    3
Male             2    2
Observed Values :-
 [[2 3]
 [2 2]]
Expected Values :-
 [[2.22222222 2.77777778]
 [1.77777778 2.22222222]]
Degree of Freedom:- 1
chi-square statistic:- 0.09000000000000008
critical_value: 3.841458820694124
p-value: 0.7641771556220945
Significance level:  0.05
Degree of Freedom:  1
chi-square statistic: 0.09000000000000008
critical_value: 3.841458820694124
p-value: 0.7641771556220945
Retain H0,There is no relationship between 2 categorical variables
Retain H0,There is no relationship between 2 categorical variables
