# Assignment 2

For this assignment you'll be looking at 2017 data on immunizations from the CDC. Your datafile for this assignment is in [assets/NISPUF17.csv](assets/NISPUF17.csv). A data users guide for this, which you'll need to map the variables in the data to the questions being asked, is available at [assets/NIS-PUF17-DUG.pdf](assets/NIS-PUF17-DUG.pdf). **Note: you may have to go to your Jupyter tree (click on the Coursera image) and navigate to the assignment 2 assets folder to see this PDF file).**

## Question 1
Write a function called `proportion_of_education` which returns the proportion of children in the dataset who had a mother with the education levels equal to less than high school (<12), high school (12), more than high school but not a college graduate (>12) and college degree.

*This function should return a dictionary in the form of (use the correct numbers, do not round numbers):* 
```
    {"less than high school":0.2,
    "high school":0.4,
    "more than high school but not college":0.2,
    "college":0.2}
```

In [None]:
def proportion_of_education():
    # your code goes here
    # YOUR CODE HERE
    
    ## Need to open the csv file via readCSV and store it in a variable
    ## That will import everything, but for sake of efficiency we could just keep only the relevant columns each time
    
    ## What columns will we need? We will need to extract the column for the mother's education level
    ## Once we have this sub-dataFrame selected, we can use 4 boolean masks and/or df.dropna() to isolate the invidual children
    ## who fall into each of the 4 categories
    ## From there all we need to do is count the number of indices in each of these 4 categories to compute the proportions we need.
    
    import pandas as pd
    
    ## Import .csv file and extract a dataFrame corresponding to the single column we are currently interested in
    df = pd.read_csv('assets/NISPUF17.csv')
    ## EDUC1 is the name of the column that describes the mother's education level
    ## EDUC1=1,2,3,4 with the four possible values referring to the four categories of education in order
    
    numKids = len(df.index)
    ## Create 4 boolean masks to count the 4 different education levels appearing in the dataset
    level1 = df[df["EDUC1"]==1]
    level2 = df[df["EDUC1"]==2]
    level3 = df[df["EDUC1"]==3]
    level4 = df[df["EDUC1"]==4]
    
    ## Store the final results in here in a dictionary
    proportions = {"less than high school":len(level1)/numKids,
    "high school":len(level2)/numKids,
    "more than high school but not college":len(level3)/numKids,
    "college":len(level4)/numKids}
    
    ##print(proportions)
    return proportions
   ## raise NotImplementedError()

In [None]:
assert type(proportion_of_education())==type({}), "You must return a dictionary."
assert len(proportion_of_education()) == 4, "You have not returned a dictionary with four items in it."
assert "less than high school" in proportion_of_education().keys(), "You have not returned a dictionary with the correct keys."
assert "high school" in proportion_of_education().keys(), "You have not returned a dictionary with the correct keys."
assert "more than high school but not college" in proportion_of_education().keys(), "You have not returned a dictionary with the correct keys."
assert "college" in proportion_of_education().keys(), "You have not returned a dictionary with the correct keys."


## Question 2

Let's explore the relationship between being fed breastmilk as a child and getting a seasonal influenza vaccine from a healthcare provider. Return a tuple of the average number of influenza vaccines for those children we know received breastmilk as a child and those who know did not.

*This function should return a tuple in the form (use the correct numbers:*
```
(2.5, 0.1)
```

In [None]:
def average_influenza_doses():
    # YOUR CODE HERE
    import pandas as pd
    
    df = pd.read_csv('assets/NISPUF17.csv')
    
    ## We first need to determine which variables describe whether or not a child was breastfed & their number of influenza vaccines
    ## Split df into two pieces, one for kids who were breastfed and one for those who were not
    ## Then compute the average value along the column for number of influenza vaccines using builtin dataframe functions
    
    ## CBF_01 describes the breastfeeding status of children. The set of values is 'Yes', 'No', "Missing', 'Don't Know', represented by
    ## 1,2,77, and 99 respectively (I assume based on the ordering in the guide)
    
    ## P_NUMFLU describes the total number of influenza vaccinations that that child received. Its value are unambiguous
    
    breastfed = df[df['CBF_01']==1]
    notBreastfed = df[df['CBF_01']==2]
    
    breastfed_mean = breastfed.loc[:,'P_NUMFLU'].mean()
    notBreastfed_mean = notBreastfed.loc[:,'P_NUMFLU'].mean()
    ## Recall: df.loc[:,'COL'].mean() returns the mean of the column 'COL' in the dataframe df as a scalar
    
    results = (breastfed_mean,notBreastfed_mean)
    ##print(results)
    return results
    
   ## raise NotImplementedError()

In [None]:
assert len(average_influenza_doses())==2, "Return two values in a tuple, the first for yes and the second for no."


## Question 3
It would be interesting to see if there is any evidence of a link between vaccine effectiveness and sex of the child. Calculate the ratio of the number of children who contracted chickenpox but were vaccinated against it (at least one varicella dose) versus those who were vaccinated but did not contract chicken pox. Return results by sex. 

*This function should return a dictionary in the form of (use the correct numbers):* 
```
    {"male":0.2,
    "female":0.4}
```

Note: To aid in verification, the `chickenpox_by_sex()['female']` value the autograder is looking for starts with the digits `0.0077`.

In [29]:
def chickenpox_by_sex():
    # YOUR CODE HERE
    import pandas as pd
    
    df = pd.read_csv('assets/NISPUF17.csv')
    
    ## Apply a boolean mask to reduce the dataframe to only those children who got chickenpox
    ## (Note that we could also throw away columns we wouldn't need to solve this problem for the sake of efficiency if this was a real
    ## world problem where speed and computational efficiency matters.)
    ## HAD_CPOX is the columnname in df for this variable and 1 corresponds to yes according to the data set guide.
    
    ## Need to first filter out vaccinated children.
    ## Then split into males and females
    ## Then compute the number of vaccinated males and females resp. who did contract chickenpox and divide by the number of 
    ## vaccinated males and females resp. who were vaccinated and didn't contract it.
    
    vaxxedChildren = df[(df['P_NUMVRC']>0) & (df['P_NUMVRC'].notna())] ## 'P_NUMVRC' is number of chickenpox vaccinations 
    males = vaxxedChildren[vaxxedChildren['SEX']==1]
    females = vaxxedChildren[vaxxedChildren['SEX']==2]
    ## There are children for whom we do not know if they had chickenpox or not and  their values for 'HAD_CPOX' are >2. Throw away those entries.
    males = males[males['HAD_CPOX']<3]
    females = females[females['HAD_CPOX']<3]
    ## We now have dataframes for the vaccinated males and females respectively. Next we count the number of kids who still got chickenpox
    ## despite getting a chickenpox vaccination.
    males_got_sick = males[males['HAD_CPOX']==1]
    females_got_sick = females[females['HAD_CPOX']==1]
    num_sick_males = len(males_got_sick)
    num_sick_females = len(females_got_sick)
                           
    ## Now count the number of kids who never got sick
    ## No need to go back to the original males and females dataframes since all kids either got chickenpox or they did not.
    num_healthy_males = len(males) - num_sick_males
    num_healthy_females = len(females) - num_sick_females
                           
    maleRatio = num_sick_males/num_healthy_males
    femaleRatio = num_sick_females/num_healthy_females
    
    ratios = {'male':maleRatio, 'female':femaleRatio}
    ##print(ratios)
    return ratios                        
    ##raise NotImplementedError()

In [30]:
assert len(chickenpox_by_sex())==2, "Return a dictionary with two items, the first for males and the second for females."


{'male': 0.009675583380762664, 'female': 0.0077918259335489565}


## Question 4
A correlation is a statistical relationship between two variables. If we wanted to know if vaccines work, we might look at the correlation between the use of the vaccine and whether it results in prevention of the infection or disease [1]. In this question, you are to see if there is a correlation between having had the chicken pox and the number of chickenpox vaccine doses given (varicella).

Some notes on interpreting the answer. The `had_chickenpox_column` is either `1` (for yes) or `2` (for no), and the `num_chickenpox_vaccine_column` is the number of doses a child has been given of the varicella vaccine. A positive correlation (e.g., `corr > 0`) means that an increase in `had_chickenpox_column` (which means more no’s) would also increase the values of `num_chickenpox_vaccine_column` (which means more doses of vaccine). If there is a negative correlation (e.g., `corr < 0`), it indicates that having had chickenpox is related to an increase in the number of vaccine doses.

Also, `pval` is the probability that we observe a correlation between `had_chickenpox_column` and `num_chickenpox_vaccine_column` which is greater than or equal to a particular value occurred by chance. A small `pval` means that the observed correlation is highly unlikely to occur by chance. In this case, `pval` should be very small (will end in `e-18` indicating a very small number).

[1] This isn’t really the full picture, since we are not looking at when the dose was given. It’s possible that children had chickenpox and then their parents went to get them the vaccine. Does this dataset have the data we would need to investigate the timing of the dose?

In [23]:
def corr_chickenpox():
    import scipy.stats as stats
    import numpy as np
    import pandas as pd
    
    # this is just an example dataframe
    ##df=pd.DataFrame({"had_chickenpox_column":np.random.randint(1,3,size=(100)),
    ##               "num_chickenpox_vaccine_column":np.random.randint(0,6,size=(100))})

    # here is some stub code to actually run the correlation
    ##corr, pval=stats.pearsonr(df["had_chickenpox_column"],df["num_chickenpox_vaccine_column"])
    
    # just return the correlation
    #return corr

    # YOUR CODE HERE
    
    ## Mimic the example that we were given above
    ## The important lesson here is to make sure that we clean our data sufficiently well before we begin to do any sort of computations
    ## or analysis with it.
    
    df = pd.read_csv('assets/NISPUF17.csv')
    df= df[df['P_NUMVRC'].notna()] ## Keep only the rows that have an actual value for the number of chickenpox vaccinations
    df = df[df['HAD_CPOX']<3] ## Throw away unknown or missing values for chickenpox status
    corr, pval=stats.pearsonr(df['HAD_CPOX'],df['P_NUMVRC'])
    ##print(corr)
    ##print(pval)
    return corr
    
    ##raise NotImplementedError()

In [24]:
assert -1<=corr_chickenpox()<=1, "You must return a float number between -1.0 and 1.0."


0.07044873460147986
2.7780263182916748e-18
