# Lecture 6 - Hypothesis Testing - Part 2

## One Sample T test

We use a one sample T-test to determine whether our sample mean (observed average) is statistically significantly different to the population mean (expected average).



### Example: We want to calculate the resting systolic blood pressure of 20 first-year resident female doctors and compare it to the general public population mean of 120 mmHg.
*. Check the Python script below for the implementation.

#### The Null hypothesis would be:- 
“There is no significant difference ”.

#### The alternate hypothesis would be:- 
“There is a statistically significant difference ”.


In [29]:
from scipy import stats

In [30]:
from scipy.stats import ttest_1samp  
import numpy as np  
  
# Creating a sample of ages  
female_doctor_bps = [128, 127, 118, 115, 144, 142, 133, 140, 132, 131, 
                     111, 132, 149, 122, 139, 119, 136, 129, 126, 128]
  
# Calculating the mean of the sample  
mean = np.mean(female_doctor_bps)  
print(mean)  
  
# Performing the T-Test   
t_test, p_val = ttest_1samp(female_doctor_bps, 120)  
print("P-value is: ", p_val)  
  
# taking the threshold value as 0.05 or 5%  
if p_val < 0.05:      
    print(" We can reject the null hypothesis")  
else:  
    print("We fail to reject the null hypothesis")  

130.05
P-value is:  0.00023838063630967753
 We can reject the null hypothesis


"There is a statistically significant difference between the resting systolic blood pressure of the resident female doctors and the general population".

## Two Sampled T Test

A two sample T-test is used to compare the means of two separate samples.



## Example: can we check if there is a difference between the two data groups of people with blood pressure??

#### The Null hypothesis would be:- 
“There is no significant difference between the blood pressures of male consultant doctors and junior resident female doctors”.

#### The alternate hypothesis would be:- 
“There is a statistically significant difference between the blood pressures of male consultant doctors and junior resident female doctors”.

In [31]:
from scipy.stats import ttest_ind  
import numpy as np  
  
# Creating the data groups  
data_group1 = np.array([128, 127, 118, 115, 144, 142, 133, 140, 132, 131, 
                     111, 132, 149, 122, 139, 119, 136, 129, 126, 128])  
data_group2 = np.array([118, 115, 112, 120, 124, 130, 123, 110, 120, 121,
                      123, 125, 129, 130, 112, 117, 119, 120, 123, 128])  
  
# Calculating the mean of the two data groups  
mean1 = np.mean(data_group1)  
mean2 = np.mean(data_group2)  
  
# Print mean values  
print("Data group 1 mean value:", mean1)  
print("Data group 2 mean value:", mean2)  
  
# Calculating standard deviation  
std1 = np.std(data_group1)  
std2 = np.std(data_group2)  
  
# Printing standard deviation values  
print("Data group 1 std value:", std1)  
print("Data group 2 std value:", std2)  
  
# Implementing the t-test  
t_test,p_val = ttest_ind(data_group1, data_group2)  
print("The P-value is: ", p_val)  
  
# taking the threshold value as 0.05 or 5%  
if p_val < 0.05:      
    print("We can reject the null hypothesis")  
else:  
    print("We can accept the null hypothesis")  

Data group 1 mean value: 130.05
Data group 2 mean value: 120.95
Data group 1 std value: 9.708115162069308
Data group 2 std value: 5.757386559889825
The P-value is:  0.0011571376404026158
We can reject the null hypothesis


## Paired Sampled T Test


Perhaps we want to compare two related samples, e.g. a before and after test, we might use a paired T-test.



## Example: We will measure the amount of sleep got by patients before and after taking soporific drugs to help them sleep.



* The null hypothesis is that the soporific drug has no effect on the sleep duration of the patients.



In [32]:
control = [8.0, 7.1, 6.5, 6.7, 7.2, 5.4, 4.7, 8.1, 6.3, 4.8]
treatment = [9.9, 7.9, 7.6, 6.8, 7.1, 9.9, 10.5, 9.7, 10.9, 8.2]

stats.ttest_rel(control, treatment)

Ttest_relResult(statistic=-3.6244859951782136, pvalue=0.0055329408161001415)

Our t-statistic value is -3.624, and along with our degrees of freedom (9) this can be used to calculate a p-value.

The p-value is 0.0055, which again is below than the standard thresholds of 0.05 or 0.01, so we reject the null hypothesis and we can say there is a statistically significant difference in sleep duration caused by the soporific drug.

# One Sampled Z Test

In [33]:
import pandas as pd

In [34]:
#Reading the dataset
data = pd.read_csv('Downloads/housing.csv')


In [35]:
data.head()

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address
0,79545.45857,5.682861,7.009188,4.09,23086.8005,1059034.0,"208 Michael Ferry Apt. 674\nLaurabury, NE 3701..."
1,79248.64245,6.0029,6.730821,3.09,40173.07217,1505891.0,"188 Johnson Views Suite 079\nLake Kathleen, CA..."
2,61287.06718,5.86589,8.512727,5.13,36882.1594,1058988.0,"9127 Elizabeth Stravenue\nDanieltown, WI 06482..."
3,63345.24005,7.188236,5.586729,3.26,34310.24283,1260617.0,USS Barnett\nFPO AP 44820
4,59982.19723,5.040555,7.839388,4.23,26354.10947,630943.5,USNS Raymond\nFPO AE 09386


In [36]:
data.columns

Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
       'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],
      dtype='object')

## Example: We are testing whether the mean of house prices is 1.232073e+06 or not


#### Null Hypothesis :
The mean house prices is 1.232073e+06

#### Alternate Hypothesis :- 
The mean house prices is not 1.232073e+06

In [37]:
data.mean()

  data.mean()


Avg. Area Income                6.858311e+04
Avg. Area House Age             5.977222e+00
Avg. Area Number of Rooms       6.987792e+00
Avg. Area Number of Bedrooms    3.981330e+00
Area Population                 3.616352e+04
Price                           1.232073e+06
dtype: float64

In [38]:
from statsmodels.stats import weightstats as stests

#from scipy import stats

ztest ,pval = stests.ztest(x1 = data['Price'], x2=None, value=1.232073e+06)
print("P Value :",float(pval))
print("z-test value :",float(ztest))

if pval<0.05:
    print(" We reject the null hypothesis")
else:
    print("We fail to reject the null hypothesis")

P Value : 0.9999447414673301
z-test value : -6.925630025777024e-05
We fail to reject the null hypothesis


# Two Sampled Z Test

## Example1: whether or not the sample means of the two groups are identical.

* H0: average of the two data groups is 0

* H1: average of the two data groups is not 0

In [39]:
# Importing the required libraries  
import pandas as pd  
from scipy import stats  
from statsmodels.stats import weightstats as stests  
  
# Creating a dataset  
data1 = [83, 85, 86, 90, 90, 93, 93, 95, 97, 97,  
         106, 108, 106, 108, 111, 113, 113, 112, 116, 111]  
  
data2 = [92, 92, 90, 93, 93, 97, 94, 98, 109, 108,  
         110, 117, 110, 115, 114, 114, 130, 130, 149, 131]  
  
# Implementing the two-sample z-test   
z_test ,p_val = stests.ztest(data1, x2 = data2, value = 0, alternative = 'two-sided')  
print(p_val)  
  
# taking the threshold value as 0.05 or 5%  
if p_val < 0.05:  
    print("We can reject the null hypothesis")  
else:  
    print("We can accept the null hypothesis")  

0.04813782199434202
We can reject the null hypothesis


## Example2: To Check if there is an association between the price and  Avg. Area House Age

In [40]:
zstats, pval = stests.ztest(data['Price'], data['Avg. Area House Age'], value = 0)
print("\np-value",pval)
print("z-score ",zstats)

if pval <0.05:
    print("we reject the null hypothesis")
else:
    print("we fail to reject the null hypothesis")


p-value 0.0
z-score  246.71742120369345
we reject the null hypothesis


# One Sampled ANOVA Test

## Example:Depending on the average similarity and f-score of two or more data groups, it may be determined if they are similar or not.

In [41]:
# Importing the required libraries  
import scipy.stats  
  
# Creating sample data  
data1 = [0.0842, 0.0368, 0.0847, 0.0935, 0.0376, 0.0963, 0.0684,  
             0.0758, 0.0854, 0.0855]  
data2 = [0.0785, 0.0845, 0.0758, 0.0853, 0.0946, 0.0785, 0.0853,  
           0.0685]  
data3 = [0.0864, 0.2522, 0.0894, 0.2724, 0.0853, 0.1367, 0.853]  
  
# Performing the F-Test   
f_test, p_val = scipy.stats.f_oneway(data1, data2, data3)  
print("p-value is: ", p_val)  
  
# taking the threshold value as 0.05 or 5%  
if p_val < 0.05:      
    print(" We can reject the null hypothesis")  
else:  
    print("We can accept the null hypothesis")  

p-value is:  0.04043792126789144
 We can reject the null hypothesis


# Chi squared Test


This test is used when two categorized variables are from the same population. Its purpose is to decide if the two elements are significantly associated.



### Example

For example, we may group people in an election campaign survey based on their preferred method of voting and gender (male or female) (Democratic, Republican, or Independent). To determine if gender affects voting choice, we may apply a chi-square test evaluating independence.

In [42]:
# Importing the required modules  
from scipy.stats import chi2_contingency  
    
# defining our data  
data = [[231, 256, 321], [245, 312, 213]]  
  
# Performing chi-square test  
test, p_val, dof, expected_val = chi2_contingency(data)  
    
# interpreting the p-value  
alpha = 0.05  
print("The p-value of our test is " + str(p_val))  
  
# Checking the hypothesis  
if p_val <= alpha:  
    print('We can reject the null hypothesis')  
else:  
    print('We can accept the null hypothesis')  

The p-value of our test is 1.4585823594475804e-06
We can reject the null hypothesis
