### Question 1
Write a function called `proportion_of_education` which returns the proportion of children in the dataset who had a mother with the education levels equal to less than high school (<12), high school (12), more than high school but not a college graduate (>12) and college degree.

*This function should return a dictionary in the form of (use the correct numbers, do not round numbers):* 
```
    {"less than high school":0.2,
    "high school":0.4,
    "more than high school but not college":0.2,
    "college":0.2}
```

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import seaborn as sns

In [2]:
df = pd.read_csv('NISPUF17.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,SEQNUMC,SEQNUMHH,PDAT,PROVWT_D,RDDWT_D,STRATUM,YEAR,AGECPOXR,HAD_CPOX,...,XVRCTY2,XVRCTY3,XVRCTY4,XVRCTY5,XVRCTY6,XVRCTY7,XVRCTY8,XVRCTY9,INS_STAT2_I,INS_BREAK_I
0,1,128521,12852,2,,235.916956,1031,2017,,2,...,,,,,,,,,,
1,2,10741,1074,2,,957.35384,1068,2017,,2,...,,,,,,,,,,
2,3,220011,22001,2,,189.611299,1050,2017,,2,...,,,,,,,,,,
3,4,86131,8613,1,675.430817,333.447418,1040,2017,,2,...,,,,,,,,,1.0,2.0
4,5,227141,22714,1,482.617748,278.768063,1008,2017,,2,...,,,,,,,,,2.0,1.0


In [4]:
print(df["EDUC1"].value_counts())
print(df["EDUC1"].value_counts().sum())
df["EDUC1"].count()

4    13656
3     6999
2     4906
1     2904
Name: EDUC1, dtype: int64
28465


28465

In [5]:
# Naive solution

def proportion_of_education():
    
    proportion_of_mEducation = dict()
    immun = pd.read_csv("NISPUF17.csv")
    total = immun["EDUC1"].count()
    levels = ["less than high school", "high school", "more than high school but not college", "college"]
    i = 1
    for level in levels:
        x = list(immun["EDUC1"].where(immun["EDUC1"] == i).value_counts() / total)
        proportion_of_mEducation[level] = x[0]
        i += 1
    
    return proportion_of_mEducation
    
    
proportion_of_education()

{'less than high school': 0.10202002459160373,
 'high school': 0.172352011241876,
 'more than high school but not college': 0.24588090637625154,
 'college': 0.47974705779026877}

In [6]:
small_df = df[["SEX", "EDUC1", "HAD_CPOX", "P_NUMFLU", "CBF_01", "P_NUMVRC"]]
small_df.head()

Unnamed: 0,SEX,EDUC1,HAD_CPOX,P_NUMFLU,CBF_01,P_NUMVRC
0,1,4,2,,1,
1,1,3,2,,2,
2,2,3,2,,2,
3,2,4,2,3.0,2,1.0
4,2,1,2,0.0,1,0.0


In [14]:
ratio_df = small_df.groupby('EDUC1').count().copy()
ratio_df

Unnamed: 0_level_0,SEX,HAD_CPOX,P_NUMFLU,CBF_01,P_NUMVRC
EDUC1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,2904,2904,1639,2904,1639
2,4906,4906,2503,4906,2503
3,6999,6999,3639,6999,3639
4,13656,13656,7552,13656,7552


In [18]:
ratio_df['SEX'].sum()

28465

In [20]:
proportion_of_mEducation = dict()
levels = ["less than high school", "high school", "more than high school but not college", "college"]
ratio_df['ratio'] = ratio_df['SEX']/ratio_df['SEX'].sum()
for i in range(1,5):
    proportion_of_mEducation[levels[i-1]] = ratio_df.loc[i, 'ratio']
    
proportion_of_mEducation

{'less than high school': 0.10202002459160373,
 'high school': 0.172352011241876,
 'more than high school but not college': 0.24588090637625154,
 'college': 0.47974705779026877}

In [21]:
# better solution

def proportion_of_education():
    
    proportion_of_mEducation = dict()
    immun = pd.read_csv("NISPUF17.csv")
    small_df = df[["SEX", "EDUC1", "HAD_CPOX", "P_NUMFLU", "CBF_01", "P_NUMVRC"]]
    levels = ["less than high school", "high school", "more than high school but not college", "college"]
    ratio_df = small_df.groupby('EDUC1').count().copy()
    ratio_df['ratio'] = ratio_df['SEX']/ratio_df['SEX'].sum()
    for i in range(1,5):
        proportion_of_mEducation[levels[i-1]] = ratio_df.loc[i, 'ratio']

    
    return proportion_of_mEducation
    
    
proportion_of_education()

{'less than high school': 0.10202002459160373,
 'high school': 0.172352011241876,
 'more than high school but not college': 0.24588090637625154,
 'college': 0.47974705779026877}

### Question 2

Let's explore the relationship between being fed breastmilk as a child and getting a seasonal influenza vaccine from a healthcare provider. Return a tuple of the average number of influenza vaccines for those children we know received breastmilk as a child and those who know did not.

*This function should return a tuple in the form (use the correct numbers:*
```
(2.5, 0.1)
```

In [22]:
def average_influenza_doses():
    
    df = pd.read_csv("NISPUF17.csv")
    fed_milk = df["P_NUMFLU"].where(df["CBF_01"] == 1).dropna()
    not_fed_milk = df["P_NUMFLU"].where(df["CBF_01"] == 2).dropna()
    return (fed_milk.mean(), not_fed_milk.mean())
print(average_influenza_doses())

(1.8799187420058687, 1.5963945918878317)


### Question 3
It would be interesting to see if there is any evidence of a link between vaccine effectiveness and sex of the child. Calculate the ratio of the number of children who contracted chickenpox but were vaccinated against it (at least one varicella dose) versus those who were vaccinated but did not contract chicken pox. Return results by sex. 

*This function should return a dictionary in the form of (use the correct numbers):* 
```
    {"male":0.2,
    "female":0.4}
```

In [24]:
def chickenpox_by_sex():
    
    df = pd.read_csv("NISPUF17.csv")
    ratio_by_sex = dict()
    small_df = df[["SEX", "HAD_CPOX", "P_NUMVRC"]]
    vaccinated = small_df.dropna()
    vaccinated = vaccinated[vaccinated["P_NUMVRC"] != 0.0]
    female_cpox = vaccinated[(vaccinated["HAD_CPOX"] == 1) & (vaccinated["SEX"] == 2)] 
    male_cpox = vaccinated[(vaccinated["HAD_CPOX"] == 1) & (vaccinated["SEX"] == 1)]
    male_no_cpox = vaccinated[(vaccinated["HAD_CPOX"] == 2) & (vaccinated["SEX"] == 1)]
    female_no_cpox = vaccinated[(vaccinated["HAD_CPOX"] == 2) & (vaccinated["SEX"] == 2)]
    ratio_by_sex["male"] = float(male_cpox["SEX"].value_counts()/male_no_cpox["SEX"].value_counts())
    ratio_by_sex["female"] = float(female_cpox["SEX"].value_counts()/female_no_cpox["SEX"].value_counts()) # had_cpox + not_nan + male / no_cpox + not_nan + male
    return ratio_by_sex

print(chickenpox_by_sex())

{'male': 0.009675583380762664, 'female': 0.0077918259335489565}


In [25]:
small_df.head()

Unnamed: 0,SEX,EDUC1,HAD_CPOX,P_NUMFLU,CBF_01,P_NUMVRC
0,1,4,2,,1,
1,1,3,2,,2,
2,2,3,2,,2,
3,2,4,2,3.0,2,1.0
4,2,1,2,0.0,1,0.0


In [30]:
small_df['HAD_CPOX'].value_counts()

2     27955
1       402
77      105
99        3
Name: HAD_CPOX, dtype: int64

In [32]:
multi_index_table = small_df.dropna().where(small_df['P_NUMVRC'] != 0).groupby(['SEX', 'HAD_CPOX']).count()
multi_index_table

Unnamed: 0_level_0,Unnamed: 1_level_0,EDUC1,P_NUMFLU,CBF_01,P_NUMVRC
SEX,HAD_CPOX,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1.0,1.0,68,68,68,68
1.0,2.0,7028,7028,7028,7028
1.0,77.0,22,22,22,22
2.0,1.0,53,53,53,53
2.0,2.0,6802,6802,6802,6802
2.0,77.0,22,22,22,22


In [34]:
multi_index_table.loc[1]

Unnamed: 0_level_0,EDUC1,P_NUMFLU,CBF_01,P_NUMVRC
HAD_CPOX,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1.0,68,68,68,68
2.0,7028,7028,7028,7028
77.0,22,22,22,22


In [36]:
multi_index_table.loc[1, 1]

EDUC1       68
P_NUMFLU    68
CBF_01      68
P_NUMVRC    68
Name: (1.0, 1.0), dtype: int64

In [39]:
multi_index_table.loc[(1, 1), 'EDUC1'] / multi_index_table.loc[(1, 2), 'EDUC1']

0.009675583380762664

In [40]:
# similarly
multi_index_table.loc[(2, 1), 'EDUC1'] / multi_index_table.loc[(2, 2), 'EDUC1']

0.0077918259335489565

### Question 4
A correlation is a statistical relationship between two variables. If we wanted to know if vaccines work, we might look at the correlation between the use of the vaccine and whether it results in prevention of the infection or disease [1]. In this question, you are to see if there is a correlation between having had the chicken pox and the number of chickenpox vaccine doses given (varicella).

Some notes on interpreting the answer. The `had_chickenpox_column` is either `1` (for yes) or `2` (for no), and the `num_chickenpox_vaccine_column` is the number of doses a child has been given of the varicella vaccine. A positive correlation (e.g., `corr > 0`) means that an increase in `had_chickenpox_column` (which means more no’s) would also increase the values of `num_chickenpox_vaccine_column` (which means more doses of vaccine). If there is a negative correlation (e.g., `corr < 0`), it indicates that having had chickenpox is related to an increase in the number of vaccine doses.

Also, `pval` is the probability that we observe a correlation between `had_chickenpox_column` and `num_chickenpox_vaccine_column` which is greater than or equal to a particular value occurred by chance. A small `pval` means that the observed correlation is highly unlikely to occur by chance. In this case, `pval` should be very small (will end in `e-18` indicating a very small number).

[1] This isn’t really the full picture, since we are not looking at when the dose was given. It’s possible that children had chickenpox and then their parents went to get them the vaccine. Does this dataset have the data we would need to investigate the timing of the dose?

In [2]:
def corr_chickenpox():
    
    df = pd.read_csv("NISPUF17.csv")
    small_df = df[["HAD_CPOX", "P_NUMVRC"]].dropna()
    small_df = small_df[(small_df["HAD_CPOX"] ==1) | (small_df["HAD_CPOX"] ==2)]
    small_df.columns = ["had_chickenpox_column", "num_chickenpox_vaccine_column"]
    small_df.sort_index(inplace=True)
    # here is some stub code to actually run the correlation
    corr, pval=stats.pearsonr(small_df["had_chickenpox_column"],small_df["num_chickenpox_vaccine_column"])
    
    # just return the correlation
    return corr

corr_chickenpox()

0.07044873460148