# Familiar: A Study In Data Analysis

Welcome to Familiar, a startup in the new market of blood transfusion! You’ve joined the team because you appreciate the flexible hours and extremely intelligent team, but the overeager doorman welcoming you into the office is a nice way to start your workday (well, work-evening).

Familiar has fallen into some tough times lately, so you’re hoping to help them make some insights about their product and help move the needle (so to speak).

In [1]:
import pandas as pd
import numpy as np

### What Can Familiar Do For You? 

The Familiar team has provided us with some data on lifespans for subscribers to two different packages, the Vein Pack and the Artery Pack! 

In [2]:
lifespans = pd.read_csv('familiar_lifespan.csv')
lifespans.head()

Unnamed: 0,pack,lifespan
0,vein,76.25509
1,artery,76.404504
2,artery,75.952442
3,artery,76.923082
4,artery,73.771212


The first thing we want to know is whether Familiar’s most basic package, the Vein Pack, actually has a significant impact on the subscribers.

In [3]:
vein_pack_lifespans = lifespans.lifespan[lifespans['pack'] == 'vein']
vein_pack_lifespans.head(2)

0    76.255090
7    74.502021
Name: lifespan, dtype: float64

In [4]:
np.mean(vein_pack_lifespans)

76.16901335636044

We’d like to find out if the average lifespan of a Vein Pack subscriber is significantly different from the average life expectancy of 73 years.

We would use to test the following null and alternative hypotheses:

- Null: The average lifespan of a Vein Pack subscriber is 73 years.
- Alternative: The average lifespan of a Vein Pack subscriber is NOT 73 years.

In [6]:
# one sample t-test
from scipy.stats import ttest_1samp

tstat, pval = ttest_1samp(vein_pack_lifespans, 73)
print('p-value for lifespan of vein one-sample t-test : ', pval)

p-value for lifespan of vein one-sample t-test :  5.972157921433211e-07


(p-value (0.00000059) is much smaller than 0.05, so we conclude that average lifespan of Vein Pack subscriber is significantly different from 73 years.) 

### Upselling Familiar: Pumping Life Into The Company 

In order to differentiate Familiar’s different product lines, we’d like to compare this lifespan data between our different packages. Our next step up from the Vein Pack is the Artery Pack.

In [7]:
artery_pack_lifespans = lifespans.lifespan[lifespans['pack'] == 'artery']
artery_pack_lifespans.head(2)

1    76.404504
2    75.952442
Name: lifespan, dtype: float64

In [8]:
np.mean(artery_pack_lifespans)

74.87366223517039

We’d like to find out if the average lifespan of a Vein Pack subscriber is significantly different from the average life expectancy for the Artery Pack.

We would use to test the following null and alternative hypotheses:

- Null: The average lifespan of a Vein Pack subscriber is equal to the average lifespan of an Artery Pack subscriber.
- Alternative: The average lifespan of a Vein Pack subscriber is NOT equal to the average lifespan of an Artery Pack subscriber.

In [9]:
# two sample t-test
from scipy.stats import ttest_ind

tstas, pval = ttest_ind(vein_pack_lifespans, artery_pack_lifespans)
print('pvalue for lifespan two-sample t-test : ', pval)

pvalue for lifespan two-sample t-test :  0.05588883079070819


(p-value (0.055) is larger than 0.05, so we conclude that the average lifespan of Vein Pack subscribers not significantly different from the average lifespan of an Artery pack subriber.)

### Side Effects: A Familiar Problem 

The Familiar team has provided us with another dataset containing survey data about iron counts for our subscribers. This data has been pre-processed to categorize iron counts as “low”, “normal”, and “high” for each subscriber. 

In [10]:
iron = pd.read_csv('familiar_iron.csv')
iron.head()

Unnamed: 0,pack,iron
0,vein,low
1,artery,normal
2,artery,normal
3,artery,normal
4,artery,high


In [11]:
xtab = pd.crosstab(iron['pack'], iron['iron'])
xtab

iron,high,low,normal
pack,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
artery,87,29,29
vein,20,140,40


We’d like to find out if there is a significant association between which pack (Vein vs. Artery) someone subscribes to and their iron level.

We would use to test the following null and alternative hypotheses:

- Null: There is NOT an association between which pack (Vein vs. Artery) someone subscribes to and their iron level.
- Alternative: There is an association between which pack (Vein vs. Artery) someone subscribes to and their iron level.

In [12]:
# chi square test
from scipy.stats import chi2_contingency

chi2, pval, dof, expected = chi2_contingency(xtab)
print('p-value : ', pval)

p-value :  9.359749337433008e-25


(p-value (0.000000000000000000000000935) is smaller than 0.05, so we conclude that there is a significant association between pack and iron level.)