# A Comparison of State Use of SFLRF Funds for Vaccination Programs and Vaccination Rates in Each State



### Data Sources:
CDC - "COVID-19 Vaccinations in the United States, Jurisdiction"
csv downloaded 5/11/23
https://data.cdc.gov/Vaccinations/COVID-19-Vaccinations-in-the-United-States-Jurisdi/unsk-b7fc

EARN/EPI - "EARN SLFRF Workbook for Q4 2022" compiled by Dave Kamper of the Economic Policy Institute (dkamper@epi.org) from Treasury reports by states and local jurisidictions who received funding, and other data sources as detailed in the workbook.

## Production Code (Team: Put your code here after it is complete and ready to go)

## Evan Work Area

In [25]:
### import dependencies and setup
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pprint import pprint
from pathlib import Path

In [26]:
### Hide error messages
pd.options.mode.chained_assignment = None  # default='warn'

In [27]:
### Load csv file(s)
all_states_sheet = Path("Resources/EARN_all_states.csv")


### Read csv file(s) as a DataFrame
all_states_df = pd.read_csv(all_states_sheet, skipinitialspace= True, low_memory=False)


### preview the raw DataFrame
print(f"There are {len(all_states_df['Project ID'])} rows in the unfiltered DataFrame.")

all_states_df.columns = all_states_df.columns.str.strip()

#all_states_df

There are 35710 rows in the unfiltered DataFrame.


In [28]:
### Review list of NA values in the 'Project Description' column
nan_values = all_states_df[all_states_df['Project Description'].isna()]

# print(len(nan_values))
print(f'There are {len(nan_values)} columns with NA values in "Project Description" column:')

#nan_values

There are 4 columns with NA values in "Project Description" column:


In [29]:
### Drop these rows where the column has NaN value
    # source: https://towardsdatascience.com/how-to-drop-rows-in-pandas-dataframes-with-nan-values-in-certain-columns-7613ad1a7f25
    
all_states_df = all_states_df.dropna(subset=['Project Description'], how='all')

### confirm the NaN rows were dropped by reviewing column length count:

print(f'The DataFrame now has {len(all_states_df["Project ID"])} rows of data:')
all_states_df.head(1)


The DataFrame now has 35706 rows of data:


Unnamed: 0,Project ID,Recipient-ID,Recipient Name,State/Territory,StateList,Reporting Tier,Recipient Type,Completion Status,Project Name,Expenditure Category Group,Expenditure Category,Project Description,Adopted Budget,Total Cumulative Obligations,Total Cumulative Expenditures,Community benefit agreement? (Infrastructure Only),Complying with David Bacon? (Infrastructure Only),Project labor agreement? (Infrastructure Only),Primary Demographic Served (Select Expenditure Categories Only)
0,TPN-039343,RCP-039196,"Woodbury County, Iowa",Iowa,Iowa,Tier 2. Metropolitan cities and counties with...,Local Government,Cancelled,LEC Main project,1-Public Health,1.4-Prevention in Congregate Settings (Nursing...,HVAC to mitigate covid,-,-,-,,,,1 Imp General Public


In [30]:
### Make the Project Description values all lowercase for value search:
all_states_df['Project Description'] = all_states_df['Project Description'].str.lower()

print(f'The Project Description column has been set to lowercase for all string values:')
all_states_df.head(2)

The Project Description column has been set to lowercase for all string values:


Unnamed: 0,Project ID,Recipient-ID,Recipient Name,State/Territory,StateList,Reporting Tier,Recipient Type,Completion Status,Project Name,Expenditure Category Group,Expenditure Category,Project Description,Adopted Budget,Total Cumulative Obligations,Total Cumulative Expenditures,Community benefit agreement? (Infrastructure Only),Complying with David Bacon? (Infrastructure Only),Project labor agreement? (Infrastructure Only),Primary Demographic Served (Select Expenditure Categories Only)
0,TPN-039343,RCP-039196,"Woodbury County, Iowa",Iowa,Iowa,Tier 2. Metropolitan cities and counties with...,Local Government,Cancelled,LEC Main project,1-Public Health,1.4-Prevention in Congregate Settings (Nursing...,hvac to mitigate covid,-,-,-,,,,1 Imp General Public
1,TPN-039461,RCP-036070,"Lexington-Fayette Urban County, Kentucky",Kentucky,Kentucky,"Tier 1. States, U.S. territories, metropolitan...",Local Government,Cancelled,Housing Stabilization - Salvation Army,6-Revenue Replacement,6.1-Provision of Government Services,financial assistance to salvation army to impr...,-,-,-,,,,


In [31]:
### Brainstorm a list of words to filter the 'Project Description' column by.
    # this list will be used to filter that column so that we are only working with projects that
    # are actually vaccine related.

search_term_list = ['immunize', 'immunization','access to vaccines', 'spikevax', 'bivalent', 'novavax', 'two-dose', 
                    'single-dose', 'emergency use authoriztaion', 'vaccine coverage', 
                    'vaccine access', 'vaccine distribution', 'distribute vaccines', 'vaccine', 'vaccination', 'vaccinate', 'moderna', 'pfizer', 'johnson & johnson', 'janssen']

#print(search_term_list)

In [32]:
### Filter the dataframe column 'Project Description'
    ## source: https://stackoverflow.com/questions/28679930/how-to-drop-rows-from-pandas-data-frame-that-contains-a-particular-string-in-a-p

    
covid_projects_df = all_states_df[all_states_df['Project Description'].str.contains('|'.join(search_term_list))]


# print(len(all_states_df['Project Description']))
print(f'The number of rows containing vaccine search criteria terms is {len(covid_projects_df["Project ID"])}')
covid_projects_df.head(2)


The number of rows containing vaccine search criteria terms is 1095


Unnamed: 0,Project ID,Recipient-ID,Recipient Name,State/Territory,StateList,Reporting Tier,Recipient Type,Completion Status,Project Name,Expenditure Category Group,Expenditure Category,Project Description,Adopted Budget,Total Cumulative Obligations,Total Cumulative Expenditures,Community benefit agreement? (Infrastructure Only),Complying with David Bacon? (Infrastructure Only),Project labor agreement? (Infrastructure Only),Primary Demographic Served (Select Expenditure Categories Only)
46,TPN-072775,RCP-036988,"Highlands County, Florida",Florida,Florida,Tier 2. Metropolitan cities and counties with...,Local Government,Cancelled,Hurricane Shelter,1-Public Health,1.4-Prevention in Congregate Settings (Nursing...,large facility designed to shelter special nee...,-,-,-,,,,1 Imp General Public
78,TPN-065542,RCP-036805,"Thurston County, Washington",Washington,Washington,"Tier 1. States, U.S. territories, metropolitan...",Local Government,Cancelled,COVID-19 Vaccination Incentive Program,1-Public Health,1.1-COVID-19 Vaccination,"as of july 31, 2021, 49.6 percent of the thurs...",-,-,-,,,,1 Imp General Public


In [33]:
### Now format all budget related columns as integers for summing in the .groupby step:

# print(all_states_df.dtypes)
# print(f'\n----------------------------\n')
# print(covid_projects_df.dtypes)

In [34]:
### Clean up values preventing change of data type to int
covid_projects_df[['Adopted Budget','Total Cumulative Obligations',
                   'Total Cumulative Expenditures']] = covid_projects_df[['Adopted Budget',
       'Total Cumulative Obligations', 'Total Cumulative Expenditures']].replace(['-', ' '] ,'', regex=True)


numeric_cols = ['Adopted Budget',
       'Total Cumulative Obligations', 'Total Cumulative Expenditures']


### Convert budget columns to int for summarizing in groupby:
covid_projects_df = covid_projects_df.replace(',','', regex=True)
covid_projects_df[numeric_cols] = covid_projects_df[numeric_cols].apply(pd.to_numeric)

### Confirm monetary columns are float/int datatypes:
# print(covid_projects_df['Adopted Budget'].unique())
print(covid_projects_df.dtypes)
#covid_projects_df.head(3)


Project ID                                                          object
Recipient-ID                                                        object
Recipient Name                                                      object
State/Territory                                                     object
StateList                                                           object
Reporting Tier                                                      object
Recipient Type                                                      object
Completion Status                                                   object
Project Name                                                        object
Expenditure Category Group                                          object
Expenditure Category                                                object
Project Description                                                 object
Adopted Budget                                                     float64
Total Cumulative Obligati

In [35]:
### Group the filtered dataframe by state, summing applicable $ value columns
    # if errors, clean columns causing errors. 
    # eg) 'Adopted Budget' column has values containing "-". This might prevent the .sum() function from working

covid_sums_df = covid_projects_df.groupby(['State/Territory'], as_index=False).sum(['Adopted Budget', 'Total Cumulative Obligations', 
                                                                        'Total Cumulative Expenditures'])

print(f'The column headers for the state_spending_df are:\n\n {covid_sums_df.columns}')
covid_sums_df.head(1)

The column headers for the state_spending_df are:

 Index(['State/Territory', 'Adopted Budget', 'Total Cumulative Obligations',
       'Total Cumulative Expenditures'],
      dtype='object')


Unnamed: 0,State/Territory,Adopted Budget,Total Cumulative Obligations,Total Cumulative Expenditures
0,Alabama,2900561.02,4714001.87,2736680.11


In [36]:
### Add column of state name abbreviations:
    # source: https://gist.github.com/rogerallen/1583593

us_state_to_abbrev = {
    "Alabama": "AL",
    "Alaska": "AK",
    "Arizona": "AZ",
    "Arkansas": "AR",
    "California": "CA",
    "Colorado": "CO",
    "Connecticut": "CT",
    "Delaware": "DE",
    "Florida": "FL",
    "Georgia": "GA",
    "Hawaii": "HI",
    "Idaho": "ID",
    "Illinois": "IL",
    "Indiana": "IN",
    "Iowa": "IA",
    "Kansas": "KS",
    "Kentucky": "KY",
    "Louisiana": "LA",
    "Maine": "ME",
    "Maryland": "MD",
    "Massachusetts": "MA",
    "Michigan": "MI",
    "Minnesota": "MN",
    "Mississippi": "MS",
    "Missouri": "MO",
    "Montana": "MT",
    "Nebraska": "NE",
    "Nevada": "NV",
    "New Hampshire": "NH",
    "New Jersey": "NJ",
    "New Mexico": "NM",
    "New York": "NY",
    "North Carolina": "NC",
    "North Dakota": "ND",
    "Ohio": "OH",
    "Oklahoma": "OK",
    "Oregon": "OR",
    "Pennsylvania": "PA",
    "Rhode Island": "RI",
    "South Carolina": "SC",
    "South Dakota": "SD",
    "Tennessee": "TN",
    "Texas": "TX",
    "Utah": "UT",
    "Vermont": "VT",
    "Virginia": "VA",
    "Washington": "WA",
    "West Virginia": "WV",
    "Wisconsin": "WI",
    "Wyoming": "WY",
    "District of Columbia": "DC",
    "American Samoa": "AS",
    "Guam": "GU",
    "Northern Mariana Islands": "MP",
    "Puerto Rico": "PR",
    "United States Minor Outlying Islands": "UM",
    "Virgin Islands": "VI",
}
    
### add abbreviated state name column and reorder so the abbrev is after full state name column:
covid_sums_df['Location'] = covid_sums_df['State/Territory'].map(us_state_to_abbrev)
covid_sums_df = covid_sums_df[['State/Territory', 'Location', 'Adopted Budget', 
                                       'Total Cumulative Obligations', 'Total Cumulative Expenditures']]

covid_sums_df.head()


Unnamed: 0,State/Territory,Location,Adopted Budget,Total Cumulative Obligations,Total Cumulative Expenditures
0,Alabama,AL,2900561.0,4714002.0,2736680.0
1,American Samoa,AS,25505250.0,24130120.0,22780120.0
2,Arizona,AZ,86048350.0,53824130.0,51529320.0
3,Arkansas,AR,2332340.0,2515517.0,1719264.0
4,California,CA,733404000.0,647083200.0,574407800.0


In [37]:
### Groupby and count 'Project ID' in the covid_projects_df
### then pd.merge onto state_spending_df, inplace=True

covid_counts_df = covid_projects_df.groupby(['State/Territory'], as_index=False).count()[['State/Territory', 'Project ID']]

covid_counts_df.head()


Unnamed: 0,State/Territory,Project ID
0,Alabama,9
1,American Samoa,3
2,Arizona,20
3,Arkansas,12
4,California,111


In [39]:
### Now merge the vaccine projects count by state onto the state_spending_df:

state_spending_df = pd.merge(covid_sums_df, covid_counts_df, how ='inner', on =('State/Territory'))


In [40]:
### Rename the counted 'Project ID' column for clarity:
state_spending_df.rename(columns = {'Project ID':'Count of Vaccine Projects'}, inplace = True)

state_spending_df

Unnamed: 0,State/Territory,Location,Adopted Budget,Total Cumulative Obligations,Total Cumulative Expenditures,Count of Vaccine Projects
0,Alabama,AL,2900561.0,4714002.0,2736680.0,9
1,American Samoa,AS,25505250.0,24130120.0,22780120.0,3
2,Arizona,AZ,86048350.0,53824130.0,51529320.0,20
3,Arkansas,AR,2332340.0,2515517.0,1719264.0,12
4,California,CA,733404000.0,647083200.0,574407800.0,111
5,Colorado,CO,198222500.0,171675700.0,130964500.0,59
6,Connecticut,CT,10684000.0,4957381.0,2758473.0,20
7,Delaware,DE,11010770.0,1674293.0,1123534.0,8
8,District of Columbia,DC,18741330.0,14212780.0,14060350.0,5
9,Florida,FL,39264150.0,48853800.0,37619770.0,55


In [41]:
### "all_us_projects_df" is for (2) from Joanna's slack message request:
all_us_projects_df = all_states_df[['Recipient Name', 'State/Territory', 'Recipient Type', 
                                    'Completion Status', 'Project Name', 'Expenditure Category Group', 'Expenditure Category', 
                                    'Project Description', 'Adopted Budget', 'Total Cumulative Obligations', 
                                    'Total Cumulative Expenditures']].copy()


all_us_projects_df['State/Territory'] = all_us_projects_df['State/Territory'].map(us_state_to_abbrev)
all_us_projects_df.rename(columns = {'State/Territory':'State'}, inplace = True)

all_us_projects_df[['Adopted Budget','Total Cumulative Obligations',
                   'Total Cumulative Expenditures']] = all_us_projects_df[['Adopted Budget',
       'Total Cumulative Obligations', 'Total Cumulative Expenditures']].replace(['-', ' '] ,'', regex=True)


numeric_cols = ['Adopted Budget',
       'Total Cumulative Obligations', 'Total Cumulative Expenditures']


### convert budget columns to int for summarizing in groupby:
all_us_projects_df = all_us_projects_df.replace(',','', regex=True)
all_us_projects_df[numeric_cols] = all_us_projects_df[numeric_cols].apply(pd.to_numeric)

# all_us_projects_df.dtypes
all_us_projects_df.head(3)

Unnamed: 0,Recipient Name,State,Recipient Type,Completion Status,Project Name,Expenditure Category Group,Expenditure Category,Project Description,Adopted Budget,Total Cumulative Obligations,Total Cumulative Expenditures
0,Woodbury County Iowa,IA,Local Government,Cancelled,LEC Main project,1-Public Health,1.4-Prevention in Congregate Settings (Nursing...,hvac to mitigate covid,,,
1,Lexington-Fayette Urban County Kentucky,KY,Local Government,Cancelled,Housing Stabilization - Salvation Army,6-Revenue Replacement,6.1-Provision of Government Services,financial assistance to salvation army to impr...,,,
2,Lexington-Fayette Urban County Kentucky,KY,Local Government,Cancelled,Non-Profit Capital Grants,6-Revenue Replacement,6.1-Provision of Government Services,the nonprofit capital project grants program i...,,,


In [42]:
### "us_covid_projects_df" is for (3) from Joanna's slack message:
us_covid_projects_df = all_us_projects_df[all_us_projects_df['Project Description'].str.contains('|'.join(search_term_list))]


# print(len(all_states_df['Project Description']))
print(f'The number of rows containing covid/vaccine search criteria terms is {len(us_covid_projects_df["Project Name"])}')
us_covid_projects_df.head()

The number of rows containing covid/vaccine search criteria terms is 1095


Unnamed: 0,Recipient Name,State,Recipient Type,Completion Status,Project Name,Expenditure Category Group,Expenditure Category,Project Description,Adopted Budget,Total Cumulative Obligations,Total Cumulative Expenditures
46,Highlands County Florida,FL,Local Government,Cancelled,Hurricane Shelter,1-Public Health,1.4-Prevention in Congregate Settings (Nursing...,large facility designed to shelter special nee...,,,
78,Thurston County Washington,WA,Local Government,Cancelled,COVID-19 Vaccination Incentive Program,1-Public Health,1.1-COVID-19 Vaccination,as of july 31 2021 49.6 percent of the thursto...,,,
121,Larimer County Colorado,CO,Local Government,Cancelled,Community Health Mapping/Dashboard,1-Public Health,1.14-Other Public Health Services,larimer county is embarking on a project to co...,,,
193,Hoboken City New Jersey,NJ,Local Government,Cancelled,Covid-19 Vaccination Staffing,1-Public Health,1.1-COVID-19 Vaccination,contract staffing for the city's covid-19 vacc...,,,
213,State Of New Hampshire,NH,State/DC,Cancelled,RPHN Clinics-Vaccine Administration,1-Public Health,1.1-COVID-19 Vaccination,request to utilize american rescue plan act (a...,,,


## Aaliyah Work Area

In [43]:
### import and read the state_summary.csv
### Load csv file(s)
state_summary_sheet = Path("Resources/state_summary.csv")


### Read csv file(s) as a DataFrame
state_summary_df = pd.read_csv(state_summary_sheet, skipinitialspace= True)


state_summary_df.head()

Unnamed: 0,State,Total state allocation (from the fed),total state plus total local federal grant,Total state spending,"Spent as of Sept 30, 2022",Total state obligated,Total state budgeted,Share of state allocation spent,Share of state allocation obligated,Share of state allocation budgeted,...,Share of local spent,Share of local obligated,Share of local budgeted,Share of state + local spent,Change in state spending since Sept (as share of total allocation),Change in local spending since Sept,Change in local government employment (inclusing public education) from Feb 2020 to Jan 2023,"Percentage change in local government employment, February 2020-Jan 2023","Change in state government jobs, Feb 2020 to Jan 2023 (thousands","Percentage change in state government jobs, Feb 2020 to Jan 2023"
0,Alabama,"$2,120,279,417","$3,287,582,722","$348,913,764","$340,112,472","$350,199,320","$1,060,139,709",16.5%,16.5%,50.0%,...,20.5%,35.0%,23.6%,18%,0.42%,4.1%,0.4,0.18%,0.3,0.25%
1,Alaska,"$1,011,788,220","$1,166,360,017","$865,562,003","$805,280,930","$884,653,257","$1,001,201,989",85.5%,87.4%,99.0%,...,62.5%,70.4%,78.0%,82%,5.96%,31.2%,-1.7,-4.10%,-0.5,-2.20%
2,Arizona,"$4,182,827,492","$6,621,288,758","$2,120,555,074","$1,923,020,697","$2,496,788,343","$2,792,726,506",50.7%,59.7%,66.8%,...,30.6%,43.3%,76.1%,43%,4.72%,3.3%,-12.0,-4.34%,0.0,0.00%
3,Arkansas,"$1,573,121,581","$2,112,900,112","$616,773,435","$546,907,964","$660,527,986","$767,344,936",39.2%,42.0%,48.8%,...,32.4%,49.3%,30.9%,37%,4.44%,8.1%,-2.9,-2.53%,-2.1,-2.68%
4,California,"$27,017,016,860","$41,419,307,889","$20,188,839,813","$19,629,506,051","$24,826,648,677","$26,933,816,205",74.7%,91.9%,99.7%,...,37.1%,46.4%,67.2%,62%,2.07%,4.3%,-60.2,-3.29%,7.4,1.37%


In [44]:
### create a reduced dataframe from the state_summary_df columns: 
    #'State', 'Total state allocation (from the fed)', 'total state plus total local federal grant', 
    #'Share of state allocation spent', 'Share of state allocation obligated', 'Share of state allocation budgeted', 
    #'Total local allocation (from the fed)', 'Share of local spent', 'Share of local obligated', 'Share of local budgeted', 
    #'Share of state + local spent'

import pandas as pd
from pathlib import Path
### Load csv file(s)
state_summary_sheet = Path("Resources/state_summary.csv")

### Read csv file(s) as a DataFrame
state_summary_df = pd.read_csv(state_summary_sheet, skipinitialspace=True)

### Selecting the desired columns
reduced_df = state_summary_df[['State', 'Total state allocation (from the fed)',
                               'total state plus total local federal grant',
                               'Share of state allocation spent', 'Share of state allocation obligated',
                               'Share of state allocation budgeted', 'Total local allocation (from the fed)',
                               'Share of local spent', 'Share of local obligated', 'Share of local budgeted',
                               'Share of state + local spent']]



reduced_df['State'] = reduced_df['State'].replace('_',' ', regex=True)

### add abbreviated state name column and reorder so the abbrev is after full state name column:
reduced_df['Location'] = reduced_df['State'].map(us_state_to_abbrev)


reduced_df.rename(columns = {'State':'State/Territory'}, inplace = True)

### Printing the reduced dataframe
# print(reduced_df.columns)
# print(reduced_df.dtypes)



In [45]:
### convert all budget columns to numeric values
### drop non-number values first:

reduced_df[['Total state allocation (from the fed)', 
            'total state plus total local federal grant',
            'Total local allocation (from the fed)']] = reduced_df[['Total state allocation (from the fed)', 
            'total state plus total local federal grant',
            'Total local allocation (from the fed)']].replace(['\$', '-', ' '] ,'', regex=True)


numeric_cols = ['Total state allocation (from the fed)', 
            'total state plus total local federal grant',
            'Total local allocation (from the fed)']


### convert budget columns to int for summarizing in groupby:
reduced_df = reduced_df.replace(',','', regex=True)
reduced_df[numeric_cols] = reduced_df[numeric_cols].apply(pd.to_numeric)

# reduced_df.dtypes
print(reduced_df.dtypes)


State/Territory                                object
Total state allocation (from the fed)         float64
total state plus total local federal grant    float64
Share of state allocation spent                object
Share of state allocation obligated            object
Share of state allocation budgeted             object
Total local allocation (from the fed)         float64
Share of local spent                           object
Share of local obligated                       object
Share of local budgeted                        object
Share of state + local spent                   object
Location                                       object
dtype: object


In [46]:
### Now replace percentage string values with a decimal float value dtype:
    # reduced_df[['Share of state allocation spent', 'Share of state allocation obligated', 'Share of state allocation budgeted']] = reduced_df[['Share of state allocation spent', 'Share of state allocation obligated', 'Share of state allocation budgeted']].str.rstrip('%').astype('float') / 100.0

convert_cols = ['Share of state allocation spent', 'Share of state allocation obligated', 'Share of state allocation budgeted', 'Share of local spent', 'Share of local obligated', 'Share of local budgeted', 'Share of state + local spent']

reduced_df = reduced_df.replace('%','', regex=True)

reduced_df[convert_cols] = reduced_df[convert_cols].astype(float)/100


# print(reduced_df.dtypes)
reduced_df.head(2)


Unnamed: 0,State/Territory,Total state allocation (from the fed),total state plus total local federal grant,Share of state allocation spent,Share of state allocation obligated,Share of state allocation budgeted,Total local allocation (from the fed),Share of local spent,Share of local obligated,Share of local budgeted,Share of state + local spent,Location
0,Alabama,2120279000.0,3287583000.0,0.165,0.165,0.5,1167303000.0,0.205,0.35,0.236,0.18,AL
1,Alaska,1011788000.0,1166360000.0,0.855,0.874,0.99,154571800.0,0.625,0.704,0.78,0.82,AK


### Evan Work Area 2:

In [47]:
### merge this data frame with Evan's "state_spending_df". Merge on the state columns.
    # https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
    # https://www.geeksforgeeks.org/how-to-join-pandas-dataframes-using-merge/#

### EARN_states combines the three budget columns from "All_US_Projects" sheet with the entire "State Summary Table" sheet.
### The three budget columns are filtered for covid projects, but all dollar value columns in the "State Summary Table" are not filtered by covid projects.

EARN_states = pd.merge(state_spending_df, reduced_df, how ='inner', on =(['State/Territory', 'Location']))

EARN_states.head(3)

Unnamed: 0,State/Territory,Location,Adopted Budget,Total Cumulative Obligations,Total Cumulative Expenditures,Count of Vaccine Projects,Total state allocation (from the fed),total state plus total local federal grant,Share of state allocation spent,Share of state allocation obligated,Share of state allocation budgeted,Total local allocation (from the fed),Share of local spent,Share of local obligated,Share of local budgeted,Share of state + local spent
0,Alabama,AL,2900561.02,4714001.87,2736680.11,9,2120279000.0,3287583000.0,0.165,0.165,0.5,1167303000.0,0.205,0.35,0.236,0.18
1,Arizona,AZ,86048345.01,53824126.47,51529318.18,20,4182827000.0,6621289000.0,0.507,0.597,0.668,2438461000.0,0.306,0.433,0.761,0.43
2,Arkansas,AR,2332340.14,2515517.14,1719263.84,12,1573122000.0,2112900000.0,0.392,0.42,0.488,539778500.0,0.324,0.493,0.309,0.37


In [48]:
### Add column to show percent of fed money spent per State:
### [Total Cumulative Expenditures]/[total state plus total local federal grant]
### sortby this new percent column.

EARN_states['Percent Spent on Covid Projects'] = state_spending_df['Total Cumulative Expenditures']/EARN_states['total state plus total local federal grant']

EARN_states.sort_values(by=['Percent Spent on Covid Projects'], ascending=False, inplace= True)

EARN_states.head(6)

Unnamed: 0,State/Territory,Location,Adopted Budget,Total Cumulative Obligations,Total Cumulative Expenditures,Count of Vaccine Projects,Total state allocation (from the fed),total state plus total local federal grant,Share of state allocation spent,Share of state allocation obligated,Share of state allocation budgeted,Total local allocation (from the fed),Share of local spent,Share of local obligated,Share of local budgeted,Share of state + local spent,Percent Spent on Covid Projects
33,North Dakota,ND,55000.0,415691.0,415691.0,3,1007503000.0,1119686000.0,0.307,0.938,0.938,112183500.0,0.214,0.372,0.381,0.3,0.125416
4,Colorado,CO,198222500.0,171675700.0,130964500.0,59,3828762000.0,5349708000.0,0.242,0.365,0.821,1520946000.0,0.328,0.44,0.69,0.27,0.107372
20,Massachusetts,MA,199343800.0,205910600.0,185962200.0,53,5286068000.0,7872009000.0,0.401,0.429,0.524,2585941000.0,0.235,0.391,0.405,0.35,0.065739
23,Mississippi,MS,5000.0,5000.0,5000.0,1,1806373000.0,2185260000.0,0.038,0.05,0.65,378886300.0,0.162,0.264,0.256,0.06,0.042669
30,New Mexico,NM,24538700.0,23936180.0,22309890.0,8,1751543000.0,2245396000.0,0.402,0.458,0.804,493852700.0,0.317,0.574,0.756,0.38,0.039973
5,Connecticut,CT,10684000.0,4957381.0,2758473.0,20,2812288000.0,3783574000.0,0.228,0.275,0.607,971285600.0,0.24,0.436,0.508,0.23,0.034614


In [49]:
# Now do combined bar and line chart. Line shows [total state plus total local federal grant]
# bar shows [Percent Spent]
# x-axis is state name
# Use this method: https://towardsdatascience.com/creating-a-dual-axis-combo-chart-in-python-52624b187834

# x_label = EARN_states['State/Territory']


In [50]:
### Update this figure using the covid_projects_df values as those can be filtered by project description.
# Create figure and axis #1


# fig, ax1 = plt.subplots()
# x = EARN_states['State/Territory']

# # plot line chart on axis #1
# ax1.plot(x, EARN_states['total state plus total local federal grant']) 
# ax1.set_ylabel('Total Federal funding ($)')
# ax1.set_ylim(0, max(EARN_states['total state plus total local federal grant']))
# ax1.legend(['test_legend1'], loc="upper left")


# # set up the 2nd axis
# ax2 = ax1.twinx()
# # plot bar chart on axis #2
# ax2.bar(x, EARN_states['Percent Spent'], width=0.5, alpha=0.5, color='orange')
# ax2.grid(False) # turn off grid #2
# ax2.set_ylabel('Percent Spent')
# ax2.set_ylim(0, 1)
# ax2.legend(['test_legend2'], loc="upper right")
# plt.show()

# print(max(EARN_states['total state plus total local federal grant']+500000))



## Data Exploration and Cleanup:
- Describe here the group's data sets and how they were cleaned for analysis

# Greg Work Area

### CDC Data

In [None]:
#Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import linregress
import scipy.stats as st
import seaborn as sns

In [None]:
#Import vaccination data from csv
vac_df = pd.read_csv('COVID-19_Vaccinations_in_the_United_States_Jurisdiction.csv')

In [None]:
#function formats the CDC dataframe for US jursdictions - see below for input formats
def CDC_format(df,key,date,add_str,drop_values): #key, filt -> str; length -> int; columns, drop_values -> list
    df = df.dropna(subset=[key])
    df = df[vac_df['Date'] == date]
    df = df[~df[key].isin(drop_values)]
    df[key] = add_str + vac_df[key].astype(str)
    df.reset_index(drop = True, inplace = True)
    return df

In [None]:
drop = ['DD2','FM','AS','VI','BP2','IH2','GU','PN','PR','VA2','PW','US','MP','MH']
vac_df = CDC_format(vac_df,'Location','12/28/2022','US_',drop)

In [None]:
vac_df

In [None]:
### Google vac data

In [None]:
#Import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import linregress
import scipy.stats as st
import seaborn as sns
import cartopy.crs as ccrs
import geoviews as gv # noqa
import pyproj
import geopandas as gpd
import hvplot.pandas
import plotly.express as px

In [None]:
#Import vaccination data from google api
vac_df = pd.read_csv('https://storage.googleapis.com/covid19-open-data/v3/latest/vaccinations.csv')

In [None]:
#function formats the google dataframes - see below for input formats
def google_format(df,key,filt,length,columns,drop_values): #key, filt -> str; length -> int; columns, drop_values -> list
    df = df.dropna(subset=[key])
    df = df[df[key].str.contains(filt)]
    mask = (df[key].str.len() == length)
    df = df.loc[mask]
    df = df[columns]
    df = df[~df[key].isin(drop_values)]
    df.reset_index(drop = True, inplace = True)
    return df

In [None]:
#Input values for vaccination data
drop = ['US_AS','US_GU','US_MP','US_PR','US_VI']
cols = ['date','location_key','cumulative_persons_fully_vaccinated','new_persons_vaccinated','new_persons_fully_vaccinated']
loc_key = 'location_key'
contains = 'US_'

In [None]:
#formatting vaccination data
vac_df = google_format(vac_df, loc_key, contains, 5, cols, drop)

In [None]:
mylist = ['Orange','Apple'] #Keywords search
pattern = '|'.join(mylist)
vac_df.location_key.str.contains(pattern)

In [None]:
#reading demographic data
dem_df = pd.read_csv('demographics.csv')

In [None]:
dem_df

In [None]:
dcols = ['location_key','population']

In [None]:
#formatting demographic data
dem_df = google_format(dem_df, loc_key, contains, 5, dcols, drop)

In [None]:
#reading epidemeology data
epi_df = pd.read_csv('https://storage.googleapis.com/covid19-open-data/v3/latest/epidemiology.csv')

In [None]:
ecols = ['location_key','cumulative_confirmed','cumulative_deceased','cumulative_recovered']

In [None]:
#formatting epidemeology data
epi_df = google_format(epi_df, loc_key, contains, 5, ecols, drop)

In [None]:
loc_key = pd.read_csv('https://storage.googleapis.com/covid19-open-data/v3/location/US.csv')

In [None]:
AK_vac_df = US_vac_df[US_vac_df['location_key'].str.contains('US_AK')]

In [None]:
#Looking at only one state - this can be skipped
AK_total = AK_vac_df['cumulative_persons_fully_vaccinated'].iloc[1:len(AK_vac_df)].sum()
AK_total

In [None]:
#we don't need this at the moment, can be skipped
def swap_rows(df, i1, i2): #Keep this!!!
    a, b = df.iloc[i1, :].copy(), df.iloc[i2, :].copy()
    df.iloc[i1, :], df.iloc[i2, :] = b, a
    return df

In [None]:
#merging dataframes
total_df = vac_df.merge(dem_df, how = 'inner',on = 'location_key')

In [None]:

total_df['percent_fully_vaccinated'] = (total_df['cumulative_persons_fully_vaccinated']/total_df['population'])*100
total_df.sort_values('percent_fully_vaccinated', ascending = False)

In [None]:
#merging dataframes
total_df = total_df.merge(epi_df, how = 'inner',on = 'location_key')

In [None]:
total_df['percent_death_rate_by_case'] = (total_df['cumulative_deceased']/total_df['cumulative_confirmed'])*100

In [None]:
total_df['percent_death_rate_per_capita'] = (total_df['cumulative_deceased']/total_df['population'])*100

In [None]:
total_df['percent_confirmed'] = (total_df['cumulative_confirmed']/total_df['population'])*100

In [None]:
total_df['state_code'] = total_df.location_key.str.replace('US_','') #adding the state code for the plotly function

In [None]:
total_df.sort_values('percent_fully_vaccinated', ascending = False)

In [None]:
#function for regression plots
def reg(df,x,y,x_text,y_text):    
    lm = st.linregress(x = df[x], y = df[y])
    data_fit = lm[0]*df[x] + lm[1]
    fit_df = pd.DataFrame({'x': df[x], 'fitted': data_fit})
    ax = sns.scatterplot(data = df, x = x, y = y)
    #ax = df.plot.scatter(y = y, x = x, s = 30)
    print(f"The r-value is: {lm[2]}")
    fit_df.plot.line(x = 'x', y = 'fitted', color = 'red', ax=ax, legend = None, xlabel = x)
    plt.text(x_text,y_text,f"y = {'%.2f' %lm[0]}x + {'%.1f' %lm[1]}", color = 'red', fontsize = 16)

In [None]:
reg(total_df,'percent_fully_vaccinated','percent_death_rate_by_case',50,0.6)

In [None]:
reg(total_df,'percent_fully_vaccinated','percent_death_rate_per_capita',50,0.15)

In [None]:
reg(total_df,'percent_fully_vaccinated','percent_confirmed',50,20)

In [None]:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

world.hvplot(c='country', geo=True)

In [None]:
#generating map of us states - you need to specify the color variable as one of the dataframe columns 
fig = px.choropleth(total_df,
                    locations='state_code', 
                    locationmode="USA-states", 
                    scope="usa",
                    color='percent_death_rate_per_capita',
                    color_continuous_scale="blues" 
                    )
# fig.add_scattergeo(
#     locations=total_df['state_code'],
#     locationmode="USA-states", 
#     text=total_df['state_code'],
#     mode='text',
# )
fig.show()

# Joanna Work Area

In [None]:
#putting Greg's code down here so I can run my area independently of the rest of the sheet without error
#Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
from scipy.stats import linregress
import scipy.stats as st
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

In [None]:
#Import vaccination data from csv
vac_df = pd.read_csv('Resources/COVID-19_Vaccinations_in_the_United_States_Jurisdiction.csv')

In [None]:
#function formats the CDC dataframe for US jursdictions - see below for input formats
def CDC_format(df,key,date,add_str,drop_values): #key, filt -> str; length -> int; columns, drop_values -> list
    df = df.dropna(subset=[key])
    df = df[vac_df['Date'] == date]
    df = df[~df[key].isin(drop_values)]
    df[key] = add_str + vac_df[key].astype(str)
    df.reset_index(drop = True, inplace = True)
    return df

In [None]:
#drop non-state territories from dataframe, select only rows with 12/28/22 data
drop = ['DD2','FM','AS','VI','BP2','IH2','GU','PN','PR','VA2','PW','US','MP','MH']
vac_df = CDC_format(vac_df,'Location','12/28/2022','US_',drop)

In [None]:
#change location to match state code
vac_df['Location'] = vac_df['Location'].str.replace('US_', '')

## To do list
Calculate population number they are using for each state and use it to calculate the Pop_Pct for Administered_Bivalent column

Compare Administered to Recip_Administered to see if there are any significant differences in any state

Make some smaller dataframes for viewing:

a) Whole pop with Distrib, Administered, Dose1, Series Complete, Additional Doses, Second Booster, Administered Bivalent

b) Each individual age group with Dose1, Series Complete, Additional Doses, Second Booster, Bivalent Booster

c) Each category (Dose1, Series Complete, Additional Doses, Second Booster, Bivalent Booster) with all age ranges

Identify which states have a high variance from the mean (general/nationwide population) in % vaccinated (looking at all dosage categories and age categories). This will show us which states were the "good vaccinators" and which the "poor vaccinators." We can then use the EARN data to see if this correlates to how much of the federal money they spent, how many vaccination projects they did, etc.


In [None]:
# get all the columns we will be interested in into one dataframe
# NOTE: there is no Pop_Pct column for the administered_bivalent, and second_booster only for the age breakouts
# but we can extrapolate from their other population calculations to calculate these. For second_booster to get state numbers
# we have to add up the vaccines from the different manufacturers because we don't have them already summed.

vac_df = pd.DataFrame(data=vac_df, columns=["Location", "Distributed", "Administered", "Recip_Administered",
                                                   "Administered_Dose1_Recip", "Administered_Dose1_Pop_Pct",
                                                   "Administered_Dose1_Recip_5Plus", "Administered_Dose1_Recip_5PlusPop_Pct",
                                                  "Administered_Dose1_Recip_12Plus", "Administered_Dose1_Recip_12PlusPop_Pct",
                                                  "Administered_Dose1_Recip_18Plus", "Administered_Dose1_Recip_18PlusPop_Pct",
                                                  "Administered_Dose1_Recip_65Plus", "Administered_Dose1_Recip_65PlusPop_Pct",
                                                  "Series_Complete_Yes", "Series_Complete_Pop_Pct", "Series_Complete_5Plus",
                                                "Series_Complete_5PlusPop_Pct", "Series_Complete_12Plus",
                                                "Series_Complete_12PlusPop_Pct", "Series_Complete_18Plus",
                                                "Series_Complete_18PlusPop_Pct", "Series_Complete_65Plus",
                                                "Series_Complete_65PlusPop_Pct", "Additional_Doses",
                                                   "Additional_Doses_Vax_Pct", "Additional_Doses_5Plus",
                                                   "Additional_Doses_5Plus_Vax_Pct", "Additional_Doses_12Plus",
                                                   "Additional_Doses_12Plus_Vax_Pct", "Additional_Doses_18Plus",
                                                   "Additional_Doses_18Plus_Vax_Pct", "Additional_Doses_50Plus",
                                                   "Additional_Doses_50Plus_Vax_Pct", "Additional_Doses_65Plus",
                                                   "Additional_Doses_65Plus_Vax_Pct", "Second_Booster_50Plus",
                                                   "Second_Booster_50Plus_Vax_Pct", "Second_Booster_65Plus",
                                                   "Second_Booster_65Plus_Vax_Pct", "Second_Booster_Janssen",
                                                    "Second_Booster_Moderna", "Second_Booster_Pfizer",
                                                    "Second_Booster_Unk_Manuf", "Administered_Bivalent",
                                                   "Bivalent_Booster_5Plus", "Bivalent_Booster_5Plus_Pop_Pct",
                                                   "Bivalent_Booster_12Plus", "Bivalent_Booster_12Plus_Pop_Pct",
                                                   "Bivalent_Booster_18Plus", "Bivalent_Booster_18Plus_Pop_Pct",
                                                "Bivalent_Booster_65Plus", "Bivalent_Booster_65Plus_Pop_Pct"])

In [None]:
# remove commas from numeric columns
# convert numeric columns to correct type
vac_df = vac_df.replace(',','', regex=True)
numeric_cols = ["Distributed", "Administered", "Recip_Administered", "Administered_Dose1_Recip", "Administered_Dose1_Pop_Pct",
                "Administered_Dose1_Recip_5Plus", "Administered_Dose1_Recip_5PlusPop_Pct", "Administered_Dose1_Recip_12Plus",
                "Administered_Dose1_Recip_12PlusPop_Pct", "Administered_Dose1_Recip_18Plus",
                "Administered_Dose1_Recip_18PlusPop_Pct", "Administered_Dose1_Recip_65Plus",
                "Administered_Dose1_Recip_65PlusPop_Pct", "Series_Complete_Yes", "Series_Complete_Pop_Pct",
                "Series_Complete_5Plus", "Series_Complete_12Plus", "Series_Complete_12PlusPop_Pct", "Series_Complete_18Plus",
                "Series_Complete_18PlusPop_Pct", "Series_Complete_65Plus", "Series_Complete_65PlusPop_Pct", "Additional_Doses",
                "Additional_Doses_Vax_Pct", "Additional_Doses_5Plus", "Additional_Doses_5Plus_Vax_Pct", "Additional_Doses_12Plus",
                "Additional_Doses_12Plus_Vax_Pct", "Additional_Doses_18Plus", "Additional_Doses_18Plus_Vax_Pct",
                "Additional_Doses_50Plus", "Additional_Doses_50Plus_Vax_Pct", "Additional_Doses_65Plus",
                "Additional_Doses_65Plus_Vax_Pct", "Second_Booster_50Plus", "Second_Booster_50Plus_Vax_Pct",
                "Second_Booster_65Plus", "Second_Booster_65Plus_Vax_Pct", "Second_Booster_Janssen",
                "Second_Booster_Moderna", "Second_Booster_Pfizer", "Second_Booster_Unk_Manuf", "Administered_Bivalent",
                "Bivalent_Booster_5Plus", "Bivalent_Booster_5Plus_Pop_Pct", "Bivalent_Booster_12Plus",
                "Bivalent_Booster_12Plus_Pop_Pct", "Bivalent_Booster_18Plus", "Bivalent_Booster_18Plus_Pop_Pct"]
vac_df[numeric_cols] = vac_df[numeric_cols].apply(pd.to_numeric)
vac_df

In [None]:
# calculate totals for second booster
vac_df["Second_Booster_Total"] = (vac_df["Second_Booster_Janssen"] + vac_df["Second_Booster_Moderna"]
                                + vac_df["Second_Booster_Pfizer"] + vac_df["Second_Booster_Unk_Manuf"])
# find their population number... ok this is off. ???
# I don't know why it is appearing they used different population numbers. Something is weird here. We could just use a number
# from the census, or just ignore the second booster.
vac_df["Pop1"] = vac_df["Series_Complete_Yes"] / (vac_df["Series_Complete_Pop_Pct"]/100)
vac_df["Pop2"] = vac_df["Administered_Dose1_Recip"] / (vac_df["Administered_Dose1_Pop_Pct"]/100)

vac_pops_df = pd.DataFrame(data=vac_df, columns=["Location", "Pop1", "Pop2"])
vac_pops_df


In [None]:
# df with vax data for all ages
vac_all_ages_df = pd.DataFrame(data=vac_df, columns=["Location", "Distributed", "Administered", "Recip_Administered",
                                                   "Administered_Dose1_Recip", "Administered_Dose1_Pop_Pct",
                                                    "Series_Complete_Yes", "Series_Complete_Pop_Pct",
                                                     "Additional_Doses", "Additional_Doses_Vax_Pct", "Administered_Bivalent"])

# add 'Dose Differential' column to track doses administered to nonresidents. Negative number = doses leaving the state
vac_all_ages_df["Dose Differential"] = vac_all_ages_df["Administered"] - vac_all_ages_df["Recip_Administered"]
vac_all_ages_df["Dose Diff. as Pct of Doses Given"] = abs(vac_all_ages_df["Dose Differential"] / vac_all_ages_df["Administered"])
vac_all_ages_df["Dose Diff. as Pct of Residents Vaxxed"] = abs(vac_all_ages_df["Dose Differential"] / vac_all_ages_df["Recip_Administered"])


In [None]:
vac_dd_df = pd.DataFrame(data=vac_all_ages_df, columns=["Location", "Distributed", "Administered", "Recip_Administered", "Dose Differential",
                         "Dose Diff. as Pct of Doses Given", "Dose Diff. as Pct of Residents Vaxxed", "Administered_Dose1_Pop_Pct", "Series_Complete_Pop_Pct",
                         "Additional_Doses_Vax_Pct"])
vac_dd_df




In [None]:
# get rows with negative dose differential (states that administered lots of vaccine to people living elsewhere)
# sort in order of large differentials to small (as percent of total doses given)
# for example: in NM at least 3.8% of the doses were given to people who lived elsewhere
vac_dd_neg_df = vac_dd_df[vac_dd_df['Dose Differential'] < 1]
vac_dd_neg_df = vac_dd_neg_df.sort_values(by=['Dose Diff. as Pct of Doses Given'], ascending=False)
vac_dd_neg_df

In [None]:
# get rows with positive dose differential (states with a lot of residents who were vaccinated elsewhere)
# sort in order of large differentials to small (as percent of total doses given)
# for example: in Arizona, at least 2.6% of the vaccinated population received doses elsewhere.
vac_dd_pos_df = vac_dd_df[vac_dd_df['Dose Differential'] >= 1]
vac_dd_pos_df = vac_dd_pos_df.sort_values(by=['Dose Diff. as Pct of Residents Vaxxed'], ascending=False)
vac_dd_pos_df

In [None]:
# Checking on second booster columns
vac_secondbooster_df = pd.DataFrame(data=vac_df, columns=["Location", "Second_Booster_50Plus",
                                                   "Second_Booster_50Plus_Vax_Pct", "Second_Booster_65Plus",
                                                   "Second_Booster_65Plus_Vax_Pct", "Second_Booster_Janssen",
                                                    "Second_Booster_Moderna", "Second_Booster_Pfizer",
                                                    "Second_Booster_Unk_Manuf"])
vac_secondbooster_df

In [None]:
# find how many doses were distributed vs administered
# calculate percent
# sort alphabetically by state
vac_waste_df = pd.DataFrame(data=vac_df, columns=["Location", "Distributed", "Administered"])
vac_waste_df["Pct. Administered"] = vac_waste_df["Administered"] / vac_waste_df["Distributed"]
vac_waste_df.sort_values('Location')

In [None]:
# Show best 10 states in vaccine distribution percentage
vac_waste_best_df = vac_waste_df.sort_values('Pct. Administered', ascending=False)
vac_waste_best_df.head(10)

In [None]:
# Show worst 10 states in vaccine distribution percentage
vac_waste_worst_df = vac_waste_df.sort_values('Pct. Administered', ascending=True)
vac_waste_worst_df.head(10)

In [None]:
# whole pop info
# NOTE: There is no percentage for the entire pop for bivalent so included the 5+ pop. If we have time, will pull in the same
# census data they used to get the correct pct
# second booster only has 50+ and 65+ % -- I'm sure there must be some reason for this, not sure what
vac_whole_pop_df = pd.DataFrame(data=vac_df, columns=["Location", "Distributed", "Administered", "Recip_Administered",
                                                   "Administered_Dose1_Recip", "Administered_Dose1_Pop_Pct",
                                                   "Series_Complete_Yes", "Series_Complete_Pop_Pct", 
                                                    "Additional_Doses", "Additional_Doses_Vax_Pct", "Second_Booster_Total", 
                                                    "Second_Booster_Janssen", "Second_Booster_Moderna", "Second_Booster_Pfizer",
                                                    "Second_Booster_Unk_Manuf", "Administered_Bivalent", "Bivalent_Booster_5Plus", 
                                                      "Bivalent_Booster_5Plus_Pop_Pct"])
vac_whole_pop_df

In [None]:
vac_5plus_df = pd.DataFrame(data=vac_df, columns=["Location", 
                                                   "Administered_Dose1_Recip_5Plus", "Administered_Dose1_Recip_5PlusPop_Pct",
                                                  "Series_Complete_5Plus", "Series_Complete_5PlusPop_Pct",
                                                   "Additional_Doses_5Plus", "Additional_Doses_5Plus_Vax_Pct", 
                                                   "Bivalent_Booster_5Plus", "Bivalent_Booster_5Plus_Pop_Pct"])
vac_5plus_df

In [None]:
# 12 plus vaccinations
vac_12plus_df = pd.DataFrame(data=vac_df, columns=["Location", "Administered_Dose1_Recip_12Plus",
                                                   "Administered_Dose1_Recip_12PlusPop_Pct", "Series_Complete_12Plus",
                                                   "Series_Complete_12PlusPop_Pct", "Additional_Doses_12Plus",
                                                   "Additional_Doses_12Plus_Vax_Pct", "Bivalent_Booster_12Plus",
                                                   "Bivalent_Booster_12Plus_Pop_Pct"])
vac_12plus_df

In [None]:
# 18 plus vaccinations
vac_18plus_df = pd.DataFrame(data=vac_df, columns=["Location", "Administered_Dose1_Recip_18Plus",
                                                   "Administered_Dose1_Recip_18PlusPop_Pct", "Series_Complete_18Plus",
                                                   "Series_Complete_18PlusPop_Pct", "Additional_Doses_18Plus",
                                                   "Additional_Doses_18Plus_Vax_Pct", "Bivalent_Booster_18Plus",
                                                   "Bivalent_Booster_18Plus_Pop_Pct"])
vac_18plus_df

In [None]:
# 65 plus vaccinations
vac_65plus_df = pd.DataFrame(data=vac_df, columns=["Location", "Administered_Dose1_Recip_65Plus",
                                                   "Administered_Dose1_Recip_65PlusPop_Pct", "Series_Complete_65Plus",
                                                   "Series_Complete_65PlusPop_Pct", "Additional_Doses_65Plus",
                                                   "Additional_Doses_65Plus_Vax_Pct", "Second_Booster_50Plus",
                                                   "Second_Booster_50Plus_Vax_Pct", "Second_Booster_65Plus",
                                                   "Second_Booster_65Plus_Vax_Pct", "Bivalent_Booster_65Plus",
                                                   "Bivalent_Booster_65Plus_Pop_Pct"])
vac_65plus_df

In [None]:
# first dose info

vac_firstdose_df = pd.DataFrame(data=vac_df, columns=["Location", "Administered_Dose1_Recip", "Administered_Dose1_Pop_Pct",
                                                   "Administered_Dose1_Recip_5Plus", "Administered_Dose1_Recip_5PlusPop_Pct",
                                                  "Administered_Dose1_Recip_12Plus", "Administered_Dose1_Recip_12PlusPop_Pct",
                                                  "Administered_Dose1_Recip_18Plus", "Administered_Dose1_Recip_18PlusPop_Pct",
                                                  "Administered_Dose1_Recip_65Plus", "Administered_Dose1_Recip_65PlusPop_Pct"])
vac_firstdose_df

In [None]:
# series complete info
vac_series_complete_df = pd.DataFrame(data=vac_df, columns=["Location", "Series_Complete_Yes", "Series_Complete_Pop_Pct",
                                                            "Series_Complete_5Plus", "Series_Complete_5PlusPop_Pct",
                                                            "Series_Complete_12Plus", "Series_Complete_12PlusPop_Pct",
                                                            "Series_Complete_18Plus", "Series_Complete_18PlusPop_Pct",
                                                            "Series_Complete_65Plus", "Series_Complete_65PlusPop_Pct"])
vac_series_complete_df

In [None]:
vac_additional_doses_df = pd.DataFrame(data=vac_df, columns=["Location", "Additional_Doses",
                                                   "Additional_Doses_Vax_Pct", "Additional_Doses_5Plus",
                                                   "Additional_Doses_5Plus_Vax_Pct", "Additional_Doses_12Plus",
                                                   "Additional_Doses_12Plus_Vax_Pct", "Additional_Doses_18Plus",
                                                   "Additional_Doses_18Plus_Vax_Pct", "Additional_Doses_50Plus",
                                                   "Additional_Doses_50Plus_Vax_Pct", "Additional_Doses_65Plus",
                                                   "Additional_Doses_65Plus_Vax_Pct"])
vac_additional_doses_df

In [None]:
vac_second_booster_df = pd.DataFrame(data=vac_df, columns=["Location", "Second_Booster_50Plus",
                                                           "Second_Booster_50Plus_Vax_Pct", "Second_Booster_65Plus",
                                                           "Second_Booster_65Plus_Vax_Pct", "Second_Booster_Total",
                                                           "Second_Booster_Janssen", "Second_Booster_Moderna",
                                                           "Second_Booster_Pfizer", "Second_Booster_Unk_Manuf"])
vac_second_booster_df

In [None]:
vac_bivalent_df = pd.DataFrame(data=vac_df, columns=["Location", "Administered_Bivalent", "Bivalent_Booster_5Plus",
                                                     "Bivalent_Booster_5Plus_Pop_Pct", "Bivalent_Booster_12Plus",
                                                     "Bivalent_Booster_12Plus_Pop_Pct", "Bivalent_Booster_18Plus",
                                                     "Bivalent_Booster_18Plus_Pop_Pct", "Bivalent_Booster_65Plus",
                                                     "Bivalent_Booster_65Plus_Pop_Pct"])
vac_bivalent_df

# Kendal Work Area

In [None]:
#putting Greg's code down here so I can run my area independently of the rest of the sheet without error
#Import libraries
import pandas as pd
import matplotlib.pyplot as plt 
import plotly.express as px
import plotly.graph_objects as go

In [None]:
#Import vaccination data from csv
vac_df = pd.read_csv('Resources/COVID-19_Vaccinations_in_the_United_States_Jurisdiction.csv')

In [None]:
#function formats the CDC dataframe for US jursdictions - see below for input formats
def CDC_format(df,key,date,add_str,drop_values): #key, filt -> str; length -> int; columns, drop_values -> list
    df = df.dropna(subset=[key])
    df = df[vac_df['Date'] == date]
    df = df[~df[key].isin(drop_values)]
    df[key] = add_str + vac_df[key].astype(str)
    df.reset_index(drop = True, inplace = True)
    return df

In [None]:
#drop non-state territories from dataframe
drop = ['DD2','FM','AS','VI','BP2','IH2','GU','PN','PR','VA2','PW','US','MP','MH']
vac_df = CDC_format(vac_df,'Location','12/28/2022','US_',drop)

In [None]:
#change location to match state code for choropleth maps
vac_df['Location'] = vac_df['Location'].str.replace('US_', '')

In [None]:
#create df with only columns related to choropleth maps
choropleth_vac_df = vac_df[['Location', 
                            'Distributed', 
                            'Administered', 
                            'Administered_Dose1_Pop_Pct', 
                            'Series_Complete_Pop_Pct', 
                            'Series_Complete_5PlusPop_Pct', 
                            'Series_Complete_12PlusPop_Pct', 
                            'Series_Complete_18PlusPop_Pct', 
                            'Series_Complete_65PlusPop_Pct', 
                            'Additional_Doses_Vax_Pct', 
                            'Additional_Doses_65Plus_Vax_Pct', 
                            'Second_Booster_65Plus_Vax_Pct', 
                            'Bivalent_Booster_5Plus_Pop_Pct', 
                            'Bivalent_Booster_12Plus_Pop_Pct', 
                            'Bivalent_Booster_18Plus_Pop_Pct', 
                            'Bivalent_Booster_65Plus_Pop_Pct']]

In [None]:
#determining color range to use for the continuous color scale in choropleth maps for bivalent booster status
choropleth_vac_bivalent = choropleth_vac_df[['Bivalent_Booster_5Plus_Pop_Pct', 
                                             'Bivalent_Booster_12Plus_Pop_Pct', 
                                             'Bivalent_Booster_18Plus_Pop_Pct', 
                                             'Bivalent_Booster_65Plus_Pop_Pct']]
choropleth_vac_bivalent = choropleth_vac_bivalent
min_bivalent_df = choropleth_vac_bivalent.min()
max_bivalent_df = choropleth_vac_bivalent.max()
min_color_range_bivalent = min_bivalent_df.min()
max_color_range_bivalent = max_bivalent_df.max()
print(f"The color range for any bivalent vaccination status choropleth map should be ({min_color_range_bivalent}, {max_color_range_bivalent}).")



In [None]:
#determining color range to use for the continuous color scale in choropleth maps for fully vaccinated, and partially or fully vaccinated
choropleth_vac_complete = choropleth_vac_df[['Administered_Dose1_Pop_Pct',
                                             'Series_Complete_Pop_Pct',
                                             'Series_Complete_5PlusPop_Pct', 
                                             'Series_Complete_12PlusPop_Pct', 
                                             'Series_Complete_18PlusPop_Pct', 
                                             'Series_Complete_65PlusPop_Pct']]
choropleth_vac_complete = choropleth_vac_complete
min_df = choropleth_vac_complete.min()
max_df = choropleth_vac_complete.max()
min_color_range = min_df.min()
max_color_range = max_df.max()
print(f"The color range for any complete vaccination choropleth map should be ({min_color_range}, {max_color_range}).")

In [None]:
fig_complete_total_pop = px.choropleth(vac_df,
                    locations='Location',
                    locationmode="USA-states",
                    scope="usa",
                    color='Series_Complete_Pop_Pct',
                    labels={'Series_Complete_Pop_Pct':'% of Population Fully Vaccinated'},
                    color_continuous_scale="viridis_r",
                    range_color=(52,95),
                    title='Vaccination Status by State - Fully Vaccinated'
                    )
fig_complete_total_pop

In [None]:
fig_complete_5plus = px.choropleth(vac_df,
                    locations='Location',
                    locationmode="USA-states",
                    scope="usa",
                    color='Series_Complete_5PlusPop_Pct',
                    range_color=(52,95),
                    labels={'Series_Complete_5PlusPop_Pct':'% of 5+ Population Fully Vaccinated'},
                    color_continuous_scale="viridis_r",
                    title='Vaccination Status by State - Fully Vaccinated (5+)'
                    )
fig_complete_5plus

In [None]:
fig_complete_12plus = px.choropleth(vac_df,
                    locations='Location',
                    locationmode="USA-states",
                    scope="usa",
                    color='Series_Complete_12PlusPop_Pct',
                    labels={'Series_Complete_12PlusPop_Pct':'% of 12+ Population Fully Vaccinated'},
                    color_continuous_scale="viridis_r",
                    range_color=(52,95),
                    title='Vaccination Status by State - Fully Vaccinated (12+)'
                    )
fig_complete_12plus

In [None]:
fig_complete_18plus = px.choropleth(vac_df,
                    locations='Location',
                    locationmode="USA-states",
                    scope="usa",
                    color='Series_Complete_18PlusPop_Pct',
                    labels={'Series_Complete_18PlusPop_Pct':'% of 18+ Population Fully Vaccinated'},
                    color_continuous_scale="viridis_r",
                    range_color=(52,95),
                    title='Vaccination Status by State - Fully Vaccinated (18+)'
                    )
fig_complete_18plus

In [None]:
fig_at_least_1 = px.choropleth(vac_df,
                    locations='Location',
                    locationmode="USA-states",
                    scope="usa",
                    color='Administered_Dose1_Pop_Pct',
                    labels={'Administered_Dose1_Pop_Pct':'% of Population Partially or Fully Vaccinated'},
                    color_continuous_scale="viridis_r",
                    range_color=(52,95),
                    title='Vaccination Status by State - Partially or Fully Vaccinated'
                    )
fig_at_least_1

In [None]:
fig_complete_65_plus = px.choropleth(vac_df,
                    locations='Location',
                    locationmode="USA-states",
                    scope="usa",
                    color='Series_Complete_65PlusPop_Pct',
                    range_color=(52,95),
                    labels={'Series_Complete_65PlusPop_Pct':'% of 65+ Population Fully Vaccinated'},
                    color_continuous_scale="viridis_r", 
                    title='Vaccination Status by State & Age - Fully Vaccinated (65+)'
                    )
fig_complete_65_plus

In [None]:
fig_bivalent_booster_65 = px.choropleth(vac_df,
                    locations='Location',
                    locationmode="USA-states",
                    scope="usa",
                    color='Bivalent_Booster_65Plus_Pop_Pct',
                    labels={'Bivalent_Booster_65Plus_Pop_Pct':'% of 65+ Population with Bivalent Booster'},
                    color_continuous_scale="magma_r",  
                    range_color=(5,63),
                    title='Bivalent Booster Status by State - (65+)'
                    )
fig_bivalent_booster_65

In [None]:
fig_bivalent_booster_5 = px.choropleth(vac_df,
                    locations='Location',
                    locationmode="USA-states",
                    scope="usa",
                    color='Bivalent_Booster_5Plus_Pop_Pct',
                    labels={'Bivalent_Booster_5Plus_Pop_Pct':'% of 5+ Population with Bivalent Booster'},
                    color_continuous_scale="magma_r",
                    range_color=(5,63),
                    title='Bivalent Booster Status by State - (5+)'
                    )
fig_bivalent_booster_5

In [None]:
fig_bivalent_booster_12 = px.choropleth(vac_df,
                    locations='Location',
                    locationmode="USA-states",
                    scope="usa",
                    color='Bivalent_Booster_12Plus_Pop_Pct',
                    labels={'Bivalent_Booster_12Plus_Pop_Pct':'% of 12+ Population with Bivalent Booster'},
                    color_continuous_scale="magma_r",  
                    range_color=(5,63),
                    title='Bivalent Booster Status by State - (12+)'
                    )
fig_bivalent_booster_12

In [None]:
fig_bivalent_booster_18 = px.choropleth(vac_df,
                    locations='Location',
                    locationmode="USA-states",
                    scope="usa",
                    color='Bivalent_Booster_18Plus_Pop_Pct',
                    labels={'Bivalent_Booster_18Plus_Pop_Pct':'% of 18+ Population with Bivalent Booster'},
                    color_continuous_scale="magma_r",  
                    range_color=(5,63),
                    title='Bivalent Booster Status by State - (18+)'
                    )
fig_bivalent_booster_18

In [None]:
#boxplots showing spread of data across all 50 states and DC for selected columns
boxplot = vac_df.boxplot(column=['Series_Complete_Pop_Pct', 
                                 'Administered_Dose1_Pop_Pct', 
                                 'Series_Complete_65PlusPop_Pct', 
                                 'Bivalent_Booster_65Plus_Pop_Pct'], 
                         grid=True,
                         figsize = (20,15),
                        )
plt.title("Distribution of Vaccination Rates Across U.S. States")
plt.xticks([1, 2, 3, 4], ['% Pop. Fully Vaccinated', '% Pop. Partially  or Fully Vaccinated', '% Pop. Fully Vaccinated - 65+', '% Pop. Bivalent Booster - 65+'])
plt.savefig('Resources/boxplot.png')

In [None]:
#finding percentage of all projects that fall in each of the 7 expenditure category groups
spending_overall = all_us_projects_df.groupby('Expenditure Category Group').count()
total_projects = spending_overall['State'].sum()
categories_percentage_overall = (spending_overall['State']/total_projects)*100
categories_percentage_overall = pd.DataFrame(categories_percentage_overall)
categories_percentage_overall.rename(columns={'State': '% of Total Projects'}, inplace=True)
categories_percentage_overall = categories_percentage_overall.reset_index()

In [None]:
#finding percentage of covid-related projects that fall in each of the 7 expenditure category groups
spending_covid_project = covid_projects_df.groupby("Expenditure Category Group").count()
total_projects_covid = spending_covid_project['Recipient Name'].sum()
categories_percentage_covid = (spending_covid_project['Recipient Name']/total_projects_covid)*100
categories_percentage_covid = pd.DataFrame(categories_percentage_covid)
categories_percentage_covid.rename(columns={'Recipient Name':'% of Covid Projects'}, inplace=True)
categories_percentage_covid.reset_index()

In [None]:
#finding the percentage of all projects that fall in each of the 7 expenditure category groups, grouped by state
spending_by_state_all = all_us_projects_df.groupby(['State', 'Expenditure Category Group']).count()
total_projects_per_state_all = spending_by_state_all.groupby('State')['Recipient Name'].sum()
categories_percentage_by_state_all = (spending_by_state_all['Recipient Name']/total_projects_per_state_all)*100
categories_percentage_by_state_all = pd.DataFrame(categories_percentage_by_state_all)
categories_percentage_by_state_all.rename(columns={'Recipient Name': '% of All Projects'}, inplace=True)
categories_percentage_covid = categories_percentage_by_state_all

In [None]:
#finding the percentage of covid-related projects that fall in each of the 7 expenditure category groups, grouped by state
spending_by_state_covid = all_us_projects_df.groupby(['State', 'Expenditure Category Group']).count()
total_projects_per_state_covid = spending_by_state_covid.groupby('State')['Recipient Name'].sum()
categories_percentage_by_state_covid = (spending_by_state_covid['Recipient Name']/total_projects_per_state_covid)*100
categories_percentage_by_state_covid = pd.DataFrame(categories_percentage_by_state_covid)
categories_percentage_by_state_covid.rename(columns={'Recipient Name': '% of Covid Projects'}, inplace=True)
categories_percentage_by_state_covid
