# Assignment 2
For this assignment you'll be looking at 2017 data on immunizations from the CDC. Your datafile for this assignment is in [assets/NISPUF17.csv](assets/NISPUF17.csv). A data users guide for this, which you'll need to map the variables in the data to the questions being asked, is available at [assets/NIS-PUF17-DUG.pdf](assets/NIS-PUF17-DUG.pdf). **Note: you may have to go to your Jupyter tree (click on the Coursera image) and navigate to the assignment 2 assets folder to see this PDF file).**

## Question 1
Write a function called `proportion_of_education` which returns the proportion of children in the dataset who had a mother with the education levels equal to less than high school (<12), high school (12), more than high school but not a college graduate (>12) and college degree.

*This function should return a dictionary in the form of (use the correct numbers, do not round numbers):* 
```
    {"less than high school":0.2,
    "high school":0.4,
    "more than high school but not college":0.2,
    "college":0.2}
```


In [46]:
# Goal: return proportion of children that had a mother with education
# lvls less than grade 12, at grade 12, over grade 12 but not college
# degree, and over grade 12 with college degree

# EDUC1 – education of the mother
# 1 is <12 years
# 2 is 12 years
# 3 is >12 years, not college graduate
# 4 is College graduate

def proportion_of_education():
    
    import pandas as pd
    
    # Open the file and take a look:
    df = pd.read_csv('assets/NISPUF17.csv')
    #print(df.head())
    
    # Isolate EDUC1
    education = df['EDUC1']
    #print(education.head())
    #print(len(education))
    
    # Plan is to make masks and use where() and dropna() functions
    # to get number of mothers of each type
    mask1 = df['EDUC1'] == 1
    mask2 = df['EDUC1'] == 2
    mask3 = df['EDUC1'] == 3
    mask4 = df['EDUC1'] == 4
    #print(mask1)
    
    # Test out the mask
    # print(education.where(mask1).dropna().head())
    #print(len(education.where(mask1).dropna()))
    
    solution = {
        # ratio = masked and dropped length / total length
        "less than high school":  len(education.where(mask1).dropna()) / len(education),
        "high school": len(education.where(mask2).dropna()) / len(education),
        "more than high school but not college": len(education.where(mask3).dropna()) / len(education),
        "college": len(education.where(mask4).dropna()) / len(education)
    }
    
    return solution
    
#proportion_of_education()

In [47]:
assert type(proportion_of_education())==type({}), "You must return a dictionary."
assert len(proportion_of_education()) == 4, "You have not returned a dictionary with four items in it."
assert "less than high school" in proportion_of_education().keys(), "You have not returned a dictionary with the correct keys."
assert "high school" in proportion_of_education().keys(), "You have not returned a dictionary with the correct keys."
assert "more than high school but not college" in proportion_of_education().keys(), "You have not returned a dictionary with the correct keys."
assert "college" in proportion_of_education().keys(), "You have not returned a dictionary with the correct keys."


## Question 2

Let's explore the relationship between being fed breastmilk as a child and getting a seasonal influenza vaccine from a healthcare provider. Return a tuple of the average number of influenza vaccines for those children we know received breastmilk as a child and those who know did not.

*This function should return a tuple in the form (use the correct numbers:*
```
(2.5, 0.1)
```

In [48]:
# Goal: return avg num of influenza vaccines for kids that got 
# breastmilk vs avg num of those that didn't, as a tuple

# CBF_01 – whether the child was fed breastmilk
# There's 1, 2, 77, and 99. wtf
# Assume 1 is Yes
# Assume 2 is No
# PDF says 77 is Don't know
# PDF says 99 is Refused response
# Doesn't seem to have any NA vals

# P_NUMFLU is the total num of seasonal flu vaccine doses
# that each kid got
# ranges from 0-6 and has NA vals

# Strategy:
# We just care about CBF_01 and P_NUMFLU
# For CBF_01, 77 and 99 vals don't help us
# For P_NUMFLU, NA vals don't help us

# 0) Isolate CBF_01 and P_NUMFLU cols with the index
# 1) Remove rows where CBF_01 = 77 or 99
# 2) Remove rows where P_NUMFLU = NA
# 3) Isolate kids that got breastmilk (CBF_01 = 1)
# 4) Find avg num of flu vaccines they got
# 5) Isolate kids that didn't get breastmilk
# 6) Find avg num of flu vaccines they got
# 7) Stick the avg nums into a soln tuple

def average_influenza_doses():
    
    import pandas as pd
    
    # Open the file and take a look:
    df = pd.read_csv('assets/NISPUF17.csv')
    #print(df.head())
    
    #0) Isolate CBF_01 and P_NUMFLU cols with the index
    cols = ['CBF_01', 'P_NUMFLU']
    newDF = df[cols]
    #print(newDF.head(20)) #P_NUMFLU's NA vals are NaN
    
    #1) Remove rows where CBF_01 = 77 or 99
    newDF = newDF[(newDF['CBF_01'] == 1) |
                  (newDF['CBF_01'] == 2)]
    #print(newDF.head(20))
    
    #2) Remove rows where P_NUMFLU = NaN
    newDF = newDF.dropna()
    #print(newDF.head(20))
    
    #3) Isolate kids that got breastmilk (CBF_01 = 1)
    BM_DF = newDF[newDF['CBF_01'] == 1]
    #print(BM_DF.head(20))
    
    #4) Find avg num of flu vaccines they got
    avg_BM_DF = BM_DF['P_NUMFLU'].mean()
    #print(avg_BM_DF)
    
    #5) Isolate kids that didn't get breastmilk
    noBM_DF = newDF[newDF['CBF_01'] == 2]
    #print(noBM_DF.head(20))
    
    #6) Find avg num of flu vaccines they got
    avg_noBM_DF = noBM_DF['P_NUMFLU'].mean()
    #print(avg_noBM_DF)
    
    #7) Stick the avg nums into a soln tuple
    solution = (avg_BM_DF, avg_noBM_DF)
    #print(solution)
    return solution
    
    #raise NotImplementedError()
    
#average_influenza_doses()

In [49]:
assert len(average_influenza_doses())==2, "Return two values in a tuple, the first for yes and the second for no."


## Question 3
It would be interesting to see if there is any evidence of a link between vaccine effectiveness and sex of the child. Calculate the ratio of the number of children who contracted chickenpox but were vaccinated against it (at least one varicella dose) versus those who were vaccinated but did not contract chicken pox. Return results by sex. 

*This function should return a dictionary in the form of (use the correct numbers):* 
```
    {"male":0.2,
    "female":0.4}
```

Note: To aid in verification, the `chickenpox_by_sex()['female']` value the autograder is looking for starts with the digits `0.0077`.

In [50]:
# Goal: return avg num of kids that were vaccinated for at
# least one varicella dose but still got chicken pox vs
# those that were vaccinated and didn't get chicken pox,
# broken down by gender, as a dictionary

# P_NUMVRC is num of varicella doses a child received
# ranges from 0-3 and has NA vals

# HAD_CPOX is whether or not a child had chicken pox
# 1 = Yes
# 2 = No
# 77 = Don't know
# 99 = Refused
# NA(?) = Missing, didn't see any missing vals

# SEX is gender
# 1 = Male
# 2 = Female

# Strategy:
# We care about gender, vaccination status, and
# whether or not the kids got chicken pox
# P_NUMVRC, HAD_CPOX, and SEX.
# For P_NUMVRC, 0 and NA vals don't help us bc
# we're looking for kids that got vaccinated
# For HAD_CPOX, 77 and 99 vals don't help us
# No problem with SEX vals. no NA vals there

# 0) Isolate HAD_CPOX, P_NUMVRC, and SEX cols
#    with the index
# 1) Remove rows where P_NUMVRC = 0 or NA
# 2) Remove rows where HAD_CPOX = 77 or 99
# 3) Separate the boys
# 4) num boys that got cpox / those that didn't
# 5) Separate the girls
# 6) num girls that got cpox / those that didn't
# 7) Throw the nums into a soln dictionary
#    and return it

def chickenpox_by_sex():
    
    import pandas as pd
    
    # Open the file and take a look:
    df = pd.read_csv('assets/NISPUF17.csv')
    #print(df.head())
    
    #0) Isolate HAD_CPOX, P_NUMVRC, and SEX cols
    #   with the index
    cols = ['HAD_CPOX', 'P_NUMVRC', 'SEX']
    newDF = df[cols]
    #print(newDF.head(20))
    
    # 1) Remove rows where P_NUMVRC = 0 or NA
    newDF = newDF[newDF['P_NUMVRC'] != 0].dropna()
    #print(newDF.head(20))
        
    # 2) Remove rows where HAD_CPOX = 77 or 99
    newDF = newDF[(newDF['HAD_CPOX'] == 1) |
                  (newDF['HAD_CPOX'] == 2)]
    #print(newDF.head(60))
    
    # 3) Separate the boys
    boysDF = newDF[newDF['SEX'] == 1]
    #print(boysDF.head())
    
    # 4) num boys that got cpox / those that didn't
    male = len(boysDF[boysDF['HAD_CPOX'] == 1].index) / len(boysDF[boysDF['HAD_CPOX'] == 2].index)
    #print(male)
    
    
    # 5) Separate the girls
    girlsDF = newDF[newDF['SEX'] == 2]
    #print(girlsDF.head())
    
    
    # 6) num girls that got cpox / those that didn't
    female = len(girlsDF[girlsDF['HAD_CPOX'] == 1].index) / len(girlsDF[girlsDF['HAD_CPOX'] == 2].index)
    #print(female)
    
    
    # 7) Throw the nums into a soln dictionary
    #    and return it
    solution = {'male': male,
               'female': female}
    #print(solution)
    return solution    
    
    #raise NotImplementedError()

#chickenpox_by_sex()

{'male': 0.009675583380762664, 'female': 0.0077918259335489565}

In [51]:
assert len(chickenpox_by_sex())==2, "Return a dictionary with two items, the first for males and the second for females."


## Question 4
A correlation is a statistical relationship between two variables. If we wanted to know if vaccines work, we might look at the correlation between the use of the vaccine and whether it results in prevention of the infection or disease [1]. In this question, you are to see if there is a correlation between having had the chicken pox and the number of chickenpox vaccine doses given (varicella).

Some notes on interpreting the answer. The `had_chickenpox_column` is either `1` (for yes) or `2` (for no), and the `num_chickenpox_vaccine_column` is the number of doses a child has been given of the varicella vaccine. A positive correlation (e.g., `corr > 0`) means that an increase in `had_chickenpox_column` (which means more no’s) would also increase the values of `num_chickenpox_vaccine_column` (which means more doses of vaccine). If there is a negative correlation (e.g., `corr < 0`), it indicates that having had chickenpox is related to an increase in the number of vaccine doses.

Also, `pval` is the probability that we observe a correlation between `had_chickenpox_column` and `num_chickenpox_vaccine_column` which is greater than or equal to a particular value occurred by chance. A small `pval` means that the observed correlation is highly unlikely to occur by chance. In this case, `pval` should be very small (will end in `e-18` indicating a very small number).

[1] This isn’t really the full picture, since we are not looking at when the dose was given. It’s possible that children had chickenpox and then their parents went to get them the vaccine. Does this dataset have the data we would need to investigate the timing of the dose?

In [67]:
# Goal: find if there's a correlation between
# HAD_CPOX and P_NUMVRC

# 0) Isolate HAD_CPOX and P_NUMVRC cols
#    with the index
# 1) Remove rows where P_NUMVRC = NA
# 2) Remove rows where HAD_CPOX = 77 or 99
# 3) Run the correlation on the 2 cols
# 4) Return the correlation

def corr_chickenpox():
    import scipy.stats as stats
    import numpy as np
    import pandas as pd
    
    # this is just an example dataframe
    #df=pd.DataFrame({"had_chickenpox_column":np.random.randint(1,3,size=(100)),
    #               "num_chickenpox_vaccine_column":np.random.randint(0,6,size=(100))})

    
    
    # here is some stub code to actually run the correlation
    #corr, pval=stats.pearsonr(df["had_chickenpox_column"],df["num_chickenpox_vaccine_column"])
    
    # just return the correlation
    #return corr

    
    
    # Open the file and take a look:
    df = pd.read_csv('assets/NISPUF17.csv')
    #print(df.head())
    
    #0) Isolate HAD_CPOX and P_NUMVRC cols
    #   with the index
    cols = ['HAD_CPOX', 'P_NUMVRC']
    newDF = df[cols]
    #print(newDF.head(20))
    
    # 1) Remove rows where P_NUMVRC = NA
    newDF = newDF.dropna()
    #print(newDF.head(20))
        
    # 2) Remove rows where HAD_CPOX = 77 or 99
    newDF = newDF[(newDF['HAD_CPOX'] == 1) |
                  (newDF['HAD_CPOX'] == 2)]
    #print(newDF.head(60))
    
    # 3) Run the correlation on the 2 cols
    corr, pval=stats.pearsonr(newDF['HAD_CPOX'],newDF['P_NUMVRC'])
    #print(corr)
    
    # 4) Return the correlation
    return corr
    
    #raise NotImplementedError()

#corr_chickenpox()

In [68]:
assert -1<=corr_chickenpox()<=1, "You must return a float number between -1.0 and 1.0."
