## For this assignment we'll be looking at 2017 data on immunizations from the CDC. The datafile for this assignment is [NISPUF17.csv](NISPUF17.csv). A data users guide is available at [NIS-PUF17-DUG.pdf](NIS-PUF17-DUG.pdf).

In [1]:
# importing the necessary libraries
import pandas as pd
import scipy.stats as stats

In [2]:
# read the file in a dataframe and show first 5 rows
df = pd.read_csv('NISPUF17.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,SEQNUMC,SEQNUMHH,PDAT,PROVWT_D,RDDWT_D,STRATUM,YEAR,AGECPOXR,HAD_CPOX,...,XVRCTY2,XVRCTY3,XVRCTY4,XVRCTY5,XVRCTY6,XVRCTY7,XVRCTY8,XVRCTY9,INS_STAT2_I,INS_BREAK_I
0,1,128521,12852,2,,235.916956,1031,2017,,2,...,,,,,,,,,,
1,2,10741,1074,2,,957.35384,1068,2017,,2,...,,,,,,,,,,
2,3,220011,22001,2,,189.611299,1050,2017,,2,...,,,,,,,,,,
3,4,86131,8613,1,675.430817,333.447418,1040,2017,,2,...,,,,,,,,,1.0,2.0
4,5,227141,22714,1,482.617748,278.768063,1008,2017,,2,...,,,,,,,,,2.0,1.0


## Write a function called `proportion_of_education` which returns the proportion of children in the dataset who had a mother with the education levels equal to less than high school (<12), high school (12), more than high school but not a college graduate (>12) and college degree.

### *This function should return a dictionary in the form of :* 
 ```
     {"less than high school":0.2,
     "high school":0.4,
     "more than high school but not college":0.2,
     "college":0.2}
 ``` 

In [5]:
def proportion_of_education(df):
    # separate columns with respective education levels
    edu = df['EDUC1']
    edu1 = df[df['EDUC1'] == 1]   # for less than high school
    edu2 = df[df['EDUC1'] == 2]   # for high school
    edu3 = df[df['EDUC1'] == 3]   # for more than high school
    edu4 = df[df['EDUC1'] == 4]   # for more than high school but not college
    
    # calculate the proportion 
    ed1=len(edu1)/len(edu)
    ed2=len(edu2)/len(edu)
    ed3=len(edu3)/len(edu)
    ed4=len(edu4)/len(edu)
    
    return {'less than high school': ed1, 'high school': ed2,
            'more than high school but not college': ed3, 'college': ed4 }

In [6]:
proportion_of_education(df)

{'less than high school': 0.10202002459160373,
 'high school': 0.172352011241876,
 'more than high school but not college': 0.24588090637625154,
 'college': 0.47974705779026877}

## Let's explore the relationship between being fed breastmilk as a child and getting a seasonal influenza vaccine from a healthcare provider. 

In [7]:
bre = df[df['CBF_01'] == 1]  # being fed breastmilk
nbre = df[df['CBF_01'] == 2]  # not fed breasmilk

In [8]:
yes = bre['P_NUMFLU'].mean()   # average numbrer of vaccine for breastmilk fed group
no = nbre['P_NUMFLU'].mean()   # average numbrer of vaccine for NO breastmilk fed group

In [27]:
# Return a tuple of the average number of influenza vaccines for those children we know received
# breastmilk as a child and those who know did not.
(yes, no)

(1.8799187420058687, 1.5963945918878317)

## It would be interesting to see if there is any evidence of a link between vaccine effectiveness and sex of the child.  

In [10]:
# separate males and females
male = df[df['SEX'] ==1]
female = df[df['SEX'] ==2]

In [11]:
# got at least one varicella dose
male_vax = male[male['P_NUMVRC'] >= 1]
female_vax = female[female['P_NUMVRC'] >= 1]

# Calculate the ratio of the number of children who contracted chickenpox but were vaccinated against it
# versus those who were vaccinate but did not contract chicken pox.
ratio_m = len(male_vax[male_vax['HAD_CPOX'] ==1])/len(male_vax[male_vax['HAD_CPOX'] ==2])
ratio_f = len(female_vax[female_vax['HAD_CPOX'] ==1])/len(female_vax[female_vax['HAD_CPOX'] ==2])

In [12]:
# Return results by sex.
{'male': ratio_m, 'female': ratio_f}

{'male': 0.009675583380762664, 'female': 0.0077918259335489565}

In [13]:
# calculate the same ratio for unvaccinated group
male_unvax = male[male['P_NUMVRC'] ==0]
ratio = len(male_unvax[male_unvax['HAD_CPOX'] ==1 ]) / len(male_unvax[male_unvax['HAD_CPOX'] == 2 ])

In [14]:
# print the ratio for males
ratio

0.043219076005961254

## Is there a correlation between having had the chicken pox and the number of chickenpox vaccine doses given (varicella)?

### Some notes on interpreting the answer. The `had_chickenpox_column` is either `1` (for yes) or `2` (for no), and the `num_chickenpox_vaccine_column` is the number of doses a child has been given of the varicella vaccine.
### A positive correlation (e.g., `corr > 0`) means that an increase in `had_chickenpox_column` (which means more no’s) would also increase the values of `num_chickenpox_vaccine_column` (which means more doses of vaccine).  If there is a negative correlation (e.g., `corr < 0`), it indicates that having had chickenpox is related to an increase in the number of vaccine doses.

In [15]:
pox_data = df[['HAD_CPOX','P_NUMVRC']]
pox_data = pox_data[pox_data['HAD_CPOX'] <=2]
final_data = pox_data.dropna()

### Also, `pval` is the probability that we observe a correlation between `had_chickenpox_column` and `num_chickenpox_vaccine_column` which is greater than or equal to a particular value occurred by chance.
### A small `pval` means that the observed correlation is highly unlikely to occur by chance. In this case,  `pval` should be very small (will end in `e-18` indicating a very small number).

In [16]:
corr, pval=stats.pearsonr(final_data["HAD_CPOX"], final_data["P_NUMVRC"])

In [17]:
corr, pval

(0.07044873460148046, 2.7780263182815486e-18)