<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Hypothesis-Testing" data-toc-modified-id="Hypothesis-Testing-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Hypothesis Testing</a></span></li><li><span><a href="#One-sample-T-test" data-toc-modified-id="One-sample-T-test-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>One-sample T-test</a></span></li><li><span><a href="#Two-sample-T-test" data-toc-modified-id="Two-sample-T-test-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Two-sample T-test</a></span></li><li><span><a href="#Chi-Square-test" data-toc-modified-id="Chi-Square-test-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Chi-Square test</a></span></li></ul></div>

In [2]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_1samp
from scipy.stats import ttest_ind
from scipy.stats import chi2_contingency

# Hypothesis Testing

This project is a simple practice implementation of three different hypothesis tests:

* One-sample T-test
* Two-sample T-test
* Chi-Square contingency test

The data is for a company that sells two products - vein packs and artery packs - with the aim of extending the lives of those who use the products.

In case the packages do not extend users lifespans, the company has planned a contingency. They have sent out a survey to customers to find out their blood iron counts, and categorised them as low, medium, or high. The distribution of responses will also be tested for significance.

The hypothesis tests are designed to test whether the product is actually achieving its goal given the distribution of lifespans for the users and some test lifespans.

It also aims to establish if there is a significant difference in results for those who use one product vs the other.

In [3]:
# Load datasets
lifespans = pd.read_csv('familiar_lifespan.csv')
iron = pd.read_csv('familiar_iron.csv')

In [4]:
lifespans.head()

Unnamed: 0,pack,lifespan
0,vein,76.25509
1,artery,76.404504
2,artery,75.952442
3,artery,76.923082
4,artery,73.771212


In [5]:
iron.head()

Unnamed: 0,pack,iron
0,vein,low
1,artery,normal
2,artery,normal
3,artery,normal
4,artery,high


In [6]:
# Save lifespans for vein pack subscribers
vein_pack_lifespans = lifespans.lifespan[lifespans.pack=='vein']

In [7]:
# Calculate average lifespan for vein pack
print(np.mean(vein_pack_lifespans))

76.16901335636044


# One-sample T-test

This test will show the p-value for a lifespan of 73 for a vein pack subscriber given the distribution of lifespans for all of the vein pack subscribers. As is typical, a p-value < 0.05 would suggest that it is a statistically significant observation having a vein pack subscriber with a lifespan of 73.

In [10]:
# Run one-sample t-test on an observed lifespan of 73 years
tstat, pval = ttest_1samp(vein_pack_lifespans, 73)
print(pval)

5.972157921433082e-07


This is statistically significant - it is extremely improbable that a vein pack subscriber would have a lifespan of just 73.

In [12]:
# Save lifespans for artery pack subscribers
artery_pack_lifespans = lifespans.lifespan[lifespans.pack=='artery']

In [13]:
# Calculate artery pack life spans
print(np.mean(artery_pack_lifespans))

74.8736622351704


# Two-sample T-test

This test will compare the distribution of lifespans for both groups of subscribers and see if there is a significant difference between the two. A p-value < 0.05 would suggest that there is a statistically significant difference in lifespans between the two groups of subscribers.

In [14]:
# run a two-sample t-test
tstat, pval = ttest_ind(vein_pack_lifespans, artery_pack_lifespans)
print(pval)

0.05588883079070819


This result is jut above the threshold for being classed as statistically significant, i.e. this would be classed as random chance as is.

As the result is so close to the threshold, I would recommend that this be repeated soon in the future when more data is collected.

# Chi-Square test

For the survey asking for iron count levels there were:

* 200 vein pack responses - 70% low, 20% normal, 10% high
* 145 artery pack responses - 20% low, 60% normal, 20% high

The aim is to test whether this difference in responses is significant.

In [15]:
# Create contingency table
Xtab = pd.crosstab(iron.pack, iron.iron)
print(Xtab)

iron    high  low  normal
pack                     
artery    87   29      29
vein      20  140      40


In [17]:
# run a Chi-Square test
chi2, pval, dof, exp = chi2_contingency(Xtab)
print(pval)

9.359749337433008e-25


There is a significant difference in the distribution of responses between the vein pack subscribers and the artery pack subscribers. This is potentially dangerous - the reason for such large numbers of respondents in the vein pack group having low iron levels should be investigated.