The Familiar team has provided us with some data on lifespans for subscribers to two different packages, the Vein Pack and the Artery Pack!
This data has been loaded for you as a dataframe named lifespans. Use the .head() method to print out the first five rows and take a look!

In [10]:
# Import libraries
import pandas as pd
import numpy as np
from scipy.stats import ttest_1samp, ttest_ind, chi2_contingency

# Load datasets
lifespans = pd.read_csv('familiar_lifespan.csv')
iron = pd.read_csv('familiar_iron.csv')
print(lifespans.head(),'\n')

     pack   lifespan
0    vein  76.255090
1  artery  76.404504
2  artery  75.952442
3  artery  76.923082
4  artery  73.771212 



The first thing we want to know is whether Familiar’s most basic package, the Vein Pack, actually has a significant impact on the subscribers. It would be a marketing goldmine if we can show that subscribers to the Vein Pack live longer than other people.
Extract the life spans of subscribers to the 'vein' pack and save the data into a variable called vein_pack_lifespans.
Next, use np.mean() to calculate the average lifespan for Vein Pack subscribers and print the result. Is it longer than 73 years?

In [11]:
vein_pack_lifespans = lifespans.lifespan[lifespans.pack == 'vein']
f'The mean lifespan in Vein Pack subscription group is {round(np.mean(vein_pack_lifespans))} year.'

'The mean lifespan in Vein Pack subscription group is 76 year.'

We’d like to find out if the average lifespan of a Vein Pack subscriber is significantly different from the average life expectancy of 73 years.
Import the statistical test from scipy.stats that we would use to test the following null and alternative hypotheses:
Null: The average lifespan of a Vein Pack subscriber is 73 years.
Alternative: The average lifespan of a Vein Pack subscriber is NOT 73 years.

Now that you’ve imported the function you need, run the significance test and print out the p-value! Is the average lifespan of a Vein Pack subscriber significantly longer than 73 years? Use a significance threshold of 0.05.

In [12]:
sig_thres = 0.05
t_stat, pval = ttest_1samp(vein_pack_lifespans, 73)
print(f'Pval: {pval}\n')
if pval < sig_thres:
  print(f'''It is lower than the significance threshold meaning the Null Hypothesis is rejected and there is a significance difference between
the means of the sample and the population.\n''')

Pval: 5.972157921433211e-07

It is lower than the significance threshold meaning the Null Hypothesis is rejected and there is a significance difference between
the means of the sample and the population.



In order to differentiate Familiar’s different product lines, we’d like to compare this lifespan data between our different packages. Our next step up from the Vein Pack is the Artery Pack.
Let’s get the lifespans of Artery Pack subscribers. Using the same lifespans dataset, extract the lifespans of subscribers to the Artery Pack and save them as artery_pack_lifespans.
Use np.mean() to calculate the average lifespan for Artery Pack subscribers and print the result. Is it longer than for the Vein Pack?

In [13]:
artery_pack_lifespans = lifespans.lifespan[lifespans.pack == 'artery']
f'The mean lifespan in Artery Pack subscription group is {round(np.mean(artery_pack_lifespans))} year.'

'The mean lifespan in Artery Pack subscription group is 75 year.'

We’d like to find out if the average lifespan of a Vein Pack subscriber is significantly different from the average life expectancy for the Artery Pack.
Import the statistical test from scipy.stats that we would use to test the following null and alternative hypotheses:
Null: The average lifespan of a Vein Pack subscriber is equal to the average lifespan of an Artery Pack subscriber.
Alternative: The average lifespan of a Vein Pack subscriber is NOT equal to the average lifespan of an Artery Pack subscriber.

Now that you’ve imported the function you need, run the significance test and print out the p-value! Is the average lifespan of a Vein Pack subscriber significantly different from the average lifespan of an Artery Pack subscriber? Use a significance threshold of 0.05.

In [14]:
t_stat_2t, pval_2t = ttest_ind(vein_pack_lifespans, artery_pack_lifespans)
print(f'Pval {pval_2t}\n')

if pval_2t < sig_thres:
  print('''It is lower than the significance threshold meaning the Null Hypothesis is rejected.
There is significant difference between the means of the 2 groups''')
else:
  print(f'''It is higher than the significance threshold meaning the Null Hypothesis is accepted.
There is no significant difference between the means of the 2 groups.\n''')

Pval 0.05588883079070819

It is higher than the significance threshold meaning the Null Hypothesis is accepted.
There is no significant difference between the means of the 2 groups.



The Familiar team has provided us with another dataset containing survey data about iron counts for our subscribers. This data has been pre-processed to categorize iron counts as “low”, “normal”, and “high” for each subscriber. Familiar wants to be able to advise potential subscribers about possible side effects of these packs and whether they differ for the Vein vs. the Artery pack.
The data has been loaded for you as a dataframe named iron. Use the .head() method to print out the first five rows and take a look!

In [15]:
print(iron.head())

     pack    iron
0    vein     low
1  artery  normal
2  artery  normal
3  artery  normal
4  artery    high


Is there an association between the pack that a subscriber gets (Vein vs. Artery) and their iron level? Use the pandas crosstab() function to create a contingency table of the pack and iron columns in the iron data. Save the result as Xtab and print it out.

In [16]:
Xtab = pd.crosstab(iron.pack, iron.iron)
print(Xtab,'\n')

iron    high  low  normal
pack                     
artery    87   29      29
vein      20  140      40 



We’d like to find out if there is a significant association between which pack (Vein vs. Artery) someone subscribes to and their iron level.
Import the statistical test from scipy.stats that we would use to test the following null and alternative hypotheses:
Null: There is NOT an association between which pack (Vein vs. Artery) someone subscribes to and their iron level.
Alternative: There is an association between which pack (Vein vs. Artery) someone subscribes to and their iron level.
Now that you’ve imported the function you need, run the significance test and print out the p-value! Is there a significant association between which pack (Vein vs. Artery) someone subscribes to and their iron level? Use a significance threshold of 0.05.

In [17]:
stat, pval_ch, dof, expected = chi2_contingency(Xtab)
print(f'Pval: {pval_ch}\n')
if pval_ch < sig_thres:
  print('''It is lower than the significance threshold meaning the Null Hypothesis is rejected.
There is significant difference between the variables.''')
else:
  print(f'''It is higher than the significance threshold meaning the Null Hypothesis is accepted.
There is no significant difference between the variables.\n''')

Pval: 9.359749337433008e-25

It is lower than the significance threshold meaning the Null Hypothesis is rejected.
There is significant difference between the variables.
