# Assignment 2

For this assignment you'll be looking at 2017 data on immunizations from the CDC. Your datafile for this assignment is in [assets/NISPUF17.csv](assets/NISPUF17.csv). A data users guide for this, which you'll need to map the variables in the data to the questions being asked, is available at [assets/NIS-PUF17-DUG.pdf](assets/NIS-PUF17-DUG.pdf). **Note: you may have to go to your Jupyter tree (click on the Coursera image) and navigate to the assignment 2 assets folder to see this PDF file).**

## Question 1
Write a function called `proportion_of_education` which returns the proportion of children in the dataset who had a mother with the education levels equal to less than high school (<12), high school (12), more than high school but not a college graduate (>12) and college degree.

*This function should return a dictionary in the form of (use the correct numbers, do not round numbers):* 
```
    {"less than high school":0.2,
    "high school":0.4,
    "more than high school but not college":0.2,
    "college":0.2}
```

In [3]:
import pandas as pd
def proportion_of_education():
    # your code goes here
    #Read the CSV file into a dataframe variable
    df = pd.read_csv('assets/NISPUF17.csv')

    #Extract the columns that just have each education level into their own variables
    df_less_than_high_school = df[df['EDUC1'] == 1]
    df_high_school = df[df['EDUC1'] == 2]
    df_more_than_high_school_less_than_college = df[df['EDUC1'] == 3]
    df_college = df[df['EDUC1'] == 4]

    #Total size of document
    total_records = df.shape[0]

    #Calculate all proportions
    df_college_proportion = df_college.shape[0] / total_records
    df_high_school_proportion = df_high_school.shape[0] / total_records
    df_more_than_high_school_less_than_college_proportion = df_more_than_high_school_less_than_college.shape[0] / total_records
    df_less_than_high_school_proportion = df_less_than_high_school.shape[0] / total_records

    #Create dictionary
    proportion_dict = {
        "less than high school":df_less_than_high_school_proportion,
        "high school": df_high_school_proportion,
        "more than high school but not college": df_more_than_high_school_less_than_college_proportion,
        "college": df_college_proportion
    }

    return proportion_dict
    
    
    #raise NotImplementedError()

In [4]:
assert type(proportion_of_education())==type({}), "You must return a dictionary."
assert len(proportion_of_education()) == 4, "You have not returned a dictionary with four items in it."
assert "less than high school" in proportion_of_education().keys(), "You have not returned a dictionary with the correct keys."
assert "high school" in proportion_of_education().keys(), "You have not returned a dictionary with the correct keys."
assert "more than high school but not college" in proportion_of_education().keys(), "You have not returned a dictionary with the correct keys."
assert "college" in proportion_of_education().keys(), "You have not returned a dictionary with the correct keys."


## Question 2

Let's explore the relationship between being fed breastmilk as a child and getting a seasonal influenza vaccine from a healthcare provider. Return a tuple of the average number of influenza vaccines for those children we know received breastmilk as a child and those who know did not.

*This function should return a tuple in the form (use the correct numbers:*
```
(2.5, 0.1)
```

In [1]:
import pandas as pd
def average_influenza_doses():
    #Read the CSV file into a dataframe variable
    df = pd.read_csv('assets/NISPUF17.csv')
    
    #Create single rows with flu data based on whether they breastfed or not
    new_df_did_breast_feed = df[df['CBF_01'] == 1][['P_NUMFLU']]
    new_df_did_not_breast_feed = df[df['CBF_01'] == 2][['P_NUMFLU']]
    
    # Drop 'NA' values from the DataFrames
    new_df_did_breast_feed = new_df_did_breast_feed.dropna(subset=['P_NUMFLU'])
    new_df_did_not_breast_feed = new_df_did_not_breast_feed.dropna(subset=['P_NUMFLU'])

    #Size of data 
    did_size = new_df_did_breast_feed.shape[0]
    did_not_size = new_df_did_not_breast_feed.shape[0]

    #Take the average of both values
    did_breast_feed_average = new_df_did_breast_feed['P_NUMFLU'].sum() / did_size
    did_not_breast_feed_average = new_df_did_not_breast_feed['P_NUMFLU'].sum() / did_not_size

    return did_breast_feed_average, did_not_breast_feed_average

In [2]:
assert len(average_influenza_doses())==2, "Return two values in a tuple, the first for yes and the second for no."


## Question 3
It would be interesting to see if there is any evidence of a link between vaccine effectiveness and sex of the child. Calculate the ratio of the number of children who contracted chickenpox but were vaccinated against it (at least one varicella dose) versus those who were vaccinated but did not contract chicken pox. Return results by sex. 

*This function should return a dictionary in the form of (use the correct numbers):* 
```
    {"male":0.2,
    "female":0.4}
```

Note: To aid in verification, the `chickenpox_by_sex()['female']` value the autograder is looking for starts with the digits `0.0077`.

In [11]:
import pandas as pd
def chickenpox_by_sex():
    #Read the CSV file into a dataframe variable
    df = pd.read_csv('assets/NISPUF17.csv')
    
    # Create variable for ease
    had_pox = 'HAD_CPOX'  # 1-yes, 2-no
    sex = 'SEX'  # 1-male, 2-female
    vaccine = 'P_NUMVRC'  # 0, did not have, >= 1 did have

    # Create separate DataFrames for males and females
    df_males = df[df[sex] == 1]
    df_females = df[df[sex] == 2]

    # Filter out based on whether vaccinated and contracted or not
    males_contracted_vaccinated = df_males[(df_males[had_pox] == 1) & (df_males[vaccine] >= 1)]
    males_not_contracted_vaccinated = df_males[(df_males[had_pox] == 2) & (df_males[vaccine] >= 1)]

    # Filter out based on whether vaccinated and contracted or not
    females_contracted_vaccinated = df_females[(df_females[had_pox] == 1) & (df_females[vaccine] >= 1)]
    females_not_contracted_vaccinated = df_females[(df_females[had_pox] == 2) & (df_females[vaccine] >= 1)]

    # Calculate the ratios
    females_ratio = females_contracted_vaccinated.shape[0] / females_not_contracted_vaccinated.shape[0]
    males_ratio = males_contracted_vaccinated.shape[0] / males_not_contracted_vaccinated.shape[0]

    final_dict = {
        "male": males_ratio,
        "female": females_ratio
    }

    return final_dict

In [12]:
assert len(chickenpox_by_sex())==2, "Return a dictionary with two items, the first for males and the second for females."


## Question 4
A correlation is a statistical relationship between two variables. If we wanted to know if vaccines work, we might look at the correlation between the use of the vaccine and whether it results in prevention of the infection or disease [1]. In this question, you are to see if there is a correlation between having had the chicken pox and the number of chickenpox vaccine doses given (varicella).

Some notes on interpreting the answer. The `had_chickenpox_column` is either `1` (for yes) or `2` (for no), and the `num_chickenpox_vaccine_column` is the number of doses a child has been given of the varicella vaccine. A positive correlation (e.g., `corr > 0`) means that an increase in `had_chickenpox_column` (which means more no’s) would also increase the values of `num_chickenpox_vaccine_column` (which means more doses of vaccine). If there is a negative correlation (e.g., `corr < 0`), it indicates that having had chickenpox is related to an increase in the number of vaccine doses.

Also, `pval` is the probability that we observe a correlation between `had_chickenpox_column` and `num_chickenpox_vaccine_column` which is greater than or equal to a particular value occurred by chance. A small `pval` means that the observed correlation is highly unlikely to occur by chance. In this case, `pval` should be very small (will end in `e-18` indicating a very small number).

[1] This isn’t really the full picture, since we are not looking at when the dose was given. It’s possible that children had chickenpox and then their parents went to get them the vaccine. Does this dataset have the data we would need to investigate the timing of the dose?

In [17]:
def corr_chickenpox():
    import scipy.stats as stats
    import numpy as np
    import pandas as pd

    # Load the data
    df = pd.read_csv('assets/NISPUF17.csv')

    # Filter out rows with 'HAD_CPOX' values of 77 and 99
    valid_had_chickenpox = df[df['HAD_CPOX'].isin([1, 2])]

    # Drop rows with NaNs from both columns
    valid_had_chickenpox = valid_had_chickenpox.dropna(subset=['HAD_CPOX', 'P_NUMVRC'])

    # Create variables for the stats function
    had_chickenpox = valid_had_chickenpox['HAD_CPOX']
    num_chickenpox_vaccine = valid_had_chickenpox['P_NUMVRC']

    # Run the correlation on values
    corr, pval = stats.pearsonr(had_chickenpox, num_chickenpox_vaccine)

    #return corr
    return corr


In [18]:
assert -1<=corr_chickenpox()<=1, "You must return a float number between -1.0 and 1.0."
