# A Comparison of State Use of SFLRF Funds for Vaccination Programs and Vaccination Rates in Each State



### Data Sources:
CDC - "COVID-19 Vaccinations in the United States, Jurisdiction"
csv downloaded 5/11/23
https://data.cdc.gov/Vaccinations/COVID-19-Vaccinations-in-the-United-States-Jurisdi/unsk-b7fc

EARN/EPI - "EARN SLFRF Workbook for Q4 2022" compiled by Dave Kamper of the Economic Policy Institute (dkamper@epi.org) from Treasury reports by states and local jurisidictions who received funding, and other data sources as detailed in the workbook.

## Production Code (Team: Put your code here after it is complete and ready to go)

## Evan Work Area

In [1]:
# import dependencies and setup
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pprint import pprint
from pathlib import Path

In [2]:
# Load csv file(s)
all_states_sheet = Path("Resources/EARN_all_states.csv")


# Read csv file(s) as a DataFrame
all_states_df = pd.read_csv(all_states_sheet, skipinitialspace= True)


# preview the raw DataFrame
print(len(all_states_df['Project ID']))
all_states_df.head()


all_states_df.columns = all_states_df.columns.str.strip()

print(all_states_df.columns)

35710
Index(['Project ID', 'Recipient-ID', 'Recipient Name', 'State/Territory',
       'StateList', 'Reporting Tier', 'Recipient Type', 'Completion Status',
       'Project Name', 'Expenditure Category Group', 'Expenditure Category',
       'Project Description', 'Adopted Budget', 'Total Cumulative Obligations',
       'Total Cumulative Expenditures',
       'Community benefit agreement? (Infrastructure Only)',
       'Complying with David Bacon? (Infrastructure Only)',
       'Project labor agreement? (Infrastructure Only)',
       'Primary Demographic Served (Select Expenditure Categories Only)'],
      dtype='object')


  all_states_df = pd.read_csv(all_states_sheet, skipinitialspace= True)


In [3]:
# Review list of NA values in the 'Project Description' column
nan_values = all_states_df[all_states_df['Project Description'].isna()]

# print(len(nan_values))
print(f'There are {len(nan_values)} columns with NA values in "Project Description" column:')

#nan_values

There are 4 columns with NA values in "Project Description" column:


In [1]:
# Drop these rows where the column has NaN value
    # source: https://towardsdatascience.com/how-to-drop-rows-in-pandas-dataframes-with-nan-values-in-certain-columns-7613ad1a7f25
    
all_states_df = all_states_df.dropna(subset=['Project Description'], how='all')

# confirm 4 rows were dropped by reviewing column length count:

print(f'The DataFrame now has {len(all_states_df["Project ID"])} rows of data.')
all_states_df.head(1)


NameError: name 'all_states_df' is not defined

In [5]:
# Make the Project Description values all lowercase for value search:
all_states_df['Project Description'] = all_states_df['Project Description'].str.lower()

print(f'The Project Description column has been set to lowercase for all string values:')
all_states_df.head(2)

The Project Description column has been set to lowercase for all string values:


Unnamed: 0,Project ID,Recipient-ID,Recipient Name,State/Territory,StateList,Reporting Tier,Recipient Type,Completion Status,Project Name,Expenditure Category Group,Expenditure Category,Project Description,Adopted Budget,Total Cumulative Obligations,Total Cumulative Expenditures,Community benefit agreement? (Infrastructure Only),Complying with David Bacon? (Infrastructure Only),Project labor agreement? (Infrastructure Only),Primary Demographic Served (Select Expenditure Categories Only)
0,TPN-039343,RCP-039196,"Woodbury County, Iowa",Iowa,Iowa,Tier 2. Metropolitan cities and counties with...,Local Government,Cancelled,LEC Main project,1-Public Health,1.4-Prevention in Congregate Settings (Nursing...,hvac to mitigate covid,-,-,-,,,,1 Imp General Public
1,TPN-039461,RCP-036070,"Lexington-Fayette Urban County, Kentucky",Kentucky,Kentucky,"Tier 1. States, U.S. territories, metropolitan...",Local Government,Cancelled,Housing Stabilization - Salvation Army,6-Revenue Replacement,6.1-Provision of Government Services,financial assistance to salvation army to impr...,-,-,-,,,,


In [6]:
# Brainstorm a list of words to filter the 'Project Description' column by.
    ## this list will be used to filter that column so that we are only working with projects that
    ## are actually covid related.
    
# TODO: confirm string case does not affect search results. eg) lowercase moderna vs Moderna.
search_term_list = ['covid', 'covid-19', 'vaccine', 'vaccination', 'vaccinated', 'moderna', 'pfizer', 'johnson & johnson', 'janssen']



In [7]:
# Filter the dataframe column 'Project Description'
    ## source: https://stackoverflow.com/questions/28679930/how-to-drop-rows-from-pandas-data-frame-that-contains-a-particular-string-in-a-p

    
covid_projects_df = all_states_df[all_states_df['Project Description'].str.contains('|'.join(search_term_list))]


# print(len(all_states_df['Project Description']))
print(f'The number of rows containing covid/vaccine search criteria terms is {len(covid_projects_df["Project ID"])}')
covid_projects_df.head(4)


The number of rows containing covid/vaccine search criteria terms is 8225


Unnamed: 0,Project ID,Recipient-ID,Recipient Name,State/Territory,StateList,Reporting Tier,Recipient Type,Completion Status,Project Name,Expenditure Category Group,Expenditure Category,Project Description,Adopted Budget,Total Cumulative Obligations,Total Cumulative Expenditures,Community benefit agreement? (Infrastructure Only),Complying with David Bacon? (Infrastructure Only),Project labor agreement? (Infrastructure Only),Primary Demographic Served (Select Expenditure Categories Only)
0,TPN-039343,RCP-039196,"Woodbury County, Iowa",Iowa,Iowa,Tier 2. Metropolitan cities and counties with...,Local Government,Cancelled,LEC Main project,1-Public Health,1.4-Prevention in Congregate Settings (Nursing...,hvac to mitigate covid,-,-,-,,,,1 Imp General Public
1,TPN-039461,RCP-036070,"Lexington-Fayette Urban County, Kentucky",Kentucky,Kentucky,"Tier 1. States, U.S. territories, metropolitan...",Local Government,Cancelled,Housing Stabilization - Salvation Army,6-Revenue Replacement,6.1-Provision of Government Services,financial assistance to salvation army to impr...,-,-,-,,,,
5,TPN-055785,RCP-035970,State Of Idaho,Idaho,Idaho,"Tier 1. States, U.S. territories, metropolitan...",State/DC,Cancelled,Reserve for Covid 19 costs,1-Public Health,1.14-Other Public Health Services,additional unanticipated covid medical costs,-,-,-,,,,1 Imp General Public
10,TPN-056253,RCP-035970,State Of Idaho,Idaho,Idaho,"Tier 1. States, U.S. territories, metropolitan...",State/DC,Cancelled,DHW Home visiting,2-Negative Economic Impacts,2.12-Healthy Childhood Environments: Home Visi...,•\tthe idaho department of health and welfare ...,-,-,-,,,,14 Dis Imp Low income HHs and populations


In [8]:
# Now format all budget related columns as integers for summing in the .groupby step:
# note that pandas imported the csv columns as an object type and not strings/ints, etc:

# print(all_states_df.dtypes)
print(f'\n----------------------------\n')
print(covid_projects_df.dtypes)


----------------------------

Project ID                                                         object
Recipient-ID                                                       object
Recipient Name                                                     object
State/Territory                                                    object
StateList                                                          object
Reporting Tier                                                     object
Recipient Type                                                     object
Completion Status                                                  object
Project Name                                                       object
Expenditure Category Group                                         object
Expenditure Category                                               object
Project Description                                                object
Adopted Budget                                                     object
Total C

In [9]:
# clean up values preventing change of data type to int
covid_projects_df[['Adopted Budget','Total Cumulative Obligations',
                   'Total Cumulative Expenditures']] = covid_projects_df[['Adopted Budget',
       'Total Cumulative Obligations', 'Total Cumulative Expenditures']].replace(['-', ' '] ,'', regex=True)


numeric_cols = ['Adopted Budget',
       'Total Cumulative Obligations', 'Total Cumulative Expenditures']


# convert budget columns to int for summarizing in groupby:
covid_projects_df = covid_projects_df.replace(',','', regex=True)
covid_projects_df[numeric_cols] = covid_projects_df[numeric_cols].apply(pd.to_numeric)


print(covid_projects_df['Adopted Budget'].unique())

# print(covid_projects_df.dtypes)

covid_projects_df.head(3)


[       nan 1000000.     28300.   ...  205796.55 1705540.      8265.39]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  covid_projects_df[['Adopted Budget','Total Cumulative Obligations',


Unnamed: 0,Project ID,Recipient-ID,Recipient Name,State/Territory,StateList,Reporting Tier,Recipient Type,Completion Status,Project Name,Expenditure Category Group,Expenditure Category,Project Description,Adopted Budget,Total Cumulative Obligations,Total Cumulative Expenditures,Community benefit agreement? (Infrastructure Only),Complying with David Bacon? (Infrastructure Only),Project labor agreement? (Infrastructure Only),Primary Demographic Served (Select Expenditure Categories Only)
0,TPN-039343,RCP-039196,Woodbury County Iowa,Iowa,Iowa,Tier 2. Metropolitan cities and counties with...,Local Government,Cancelled,LEC Main project,1-Public Health,1.4-Prevention in Congregate Settings (Nursing...,hvac to mitigate covid,,,,,,,1 Imp General Public
1,TPN-039461,RCP-036070,Lexington-Fayette Urban County Kentucky,Kentucky,Kentucky,Tier 1. States U.S. territories metropolitan c...,Local Government,Cancelled,Housing Stabilization - Salvation Army,6-Revenue Replacement,6.1-Provision of Government Services,financial assistance to salvation army to impr...,,,,,,,
5,TPN-055785,RCP-035970,State Of Idaho,Idaho,Idaho,Tier 1. States U.S. territories metropolitan c...,State/DC,Cancelled,Reserve for Covid 19 costs,1-Public Health,1.14-Other Public Health Services,additional unanticipated covid medical costs,,,,,,,1 Imp General Public


In [10]:
# Try to group the filtered dataframe by state, summing applicable $ value columns
    ## if we get errors, then we need to clean columns causing errors. 
    ## eg) 'Adopted Budget' column has values containing "-". This might prevent the .sum() function from working

# example) covid_projects_df.groupby(['State/Territory']).sum(['Adopted Budget', 'Total Cumulative Obligations', 'Total Cumulative Expenditures'])


state_spending_df = covid_projects_df.groupby(['State/Territory'], as_index=False).sum(['Adopted Budget', 'Total Cumulative Obligations', 
                                                                        'Total Cumulative Expenditures'])

print(f'The column headers for the state_spending_df are:\n\n {state_spending_df.columns}')
state_spending_df.head()

The column headers for the state_spending_df are:

 Index(['State/Territory', 'Adopted Budget', 'Total Cumulative Obligations',
       'Total Cumulative Expenditures'],
      dtype='object')


Unnamed: 0,State/Territory,Adopted Budget,Total Cumulative Obligations,Total Cumulative Expenditures
0,Alabama,341856900.0,239994500.0,192230300.0
1,Alaska,89922310.0,50857970.0,47895680.0
2,American Samoa,447866300.0,32768640.0,30226340.0
3,Arizona,1530397000.0,991496100.0,693236200.0
4,Arkansas,161146800.0,159726000.0,144292500.0


In [11]:
# Add column of state name abbreviations:
# source: https://gist.github.com/rogerallen/1583593

us_state_to_abbrev = {
    "Alabama": "AL",
    "Alaska": "AK",
    "Arizona": "AZ",
    "Arkansas": "AR",
    "California": "CA",
    "Colorado": "CO",
    "Connecticut": "CT",
    "Delaware": "DE",
    "Florida": "FL",
    "Georgia": "GA",
    "Hawaii": "HI",
    "Idaho": "ID",
    "Illinois": "IL",
    "Indiana": "IN",
    "Iowa": "IA",
    "Kansas": "KS",
    "Kentucky": "KY",
    "Louisiana": "LA",
    "Maine": "ME",
    "Maryland": "MD",
    "Massachusetts": "MA",
    "Michigan": "MI",
    "Minnesota": "MN",
    "Mississippi": "MS",
    "Missouri": "MO",
    "Montana": "MT",
    "Nebraska": "NE",
    "Nevada": "NV",
    "New Hampshire": "NH",
    "New Jersey": "NJ",
    "New Mexico": "NM",
    "New York": "NY",
    "North Carolina": "NC",
    "North Dakota": "ND",
    "Ohio": "OH",
    "Oklahoma": "OK",
    "Oregon": "OR",
    "Pennsylvania": "PA",
    "Rhode Island": "RI",
    "South Carolina": "SC",
    "South Dakota": "SD",
    "Tennessee": "TN",
    "Texas": "TX",
    "Utah": "UT",
    "Vermont": "VT",
    "Virginia": "VA",
    "Washington": "WA",
    "West Virginia": "WV",
    "Wisconsin": "WI",
    "Wyoming": "WY",
    "District of Columbia": "DC",
    "American Samoa": "AS",
    "Guam": "GU",
    "Northern Mariana Islands": "MP",
    "Puerto Rico": "PR",
    "United States Minor Outlying Islands": "UM",
    "U.S. Virgin Islands": "VI",
}
    
# add abbreviated state name column and reorder so the abbrev is after full state name column:
state_spending_df['Location'] = state_spending_df['State/Territory'].map(us_state_to_abbrev)
state_spending_df = state_spending_df[['State/Territory', 'Location', 'Adopted Budget', 
                                       'Total Cumulative Obligations', 'Total Cumulative Expenditures']]

state_spending_df.head()


Unnamed: 0,State/Territory,Location,Adopted Budget,Total Cumulative Obligations,Total Cumulative Expenditures
0,Alabama,AL,341856900.0,239994500.0,192230300.0
1,Alaska,AK,89922310.0,50857970.0,47895680.0
2,American Samoa,AS,447866300.0,32768640.0,30226340.0
3,Arizona,AZ,1530397000.0,991496100.0,693236200.0
4,Arkansas,AR,161146800.0,159726000.0,144292500.0


In [12]:
# "all_us_projects_df" is for (2) from Joanna's slack message request:
all_us_projects_df = all_states_df[['Recipient Name', 'State/Territory', 'Recipient Type', 
                                    'Completion Status', 'Project Name', 'Expenditure Category Group', 'Expenditure Category', 
                                    'Project Description', 'Adopted Budget', 'Total Cumulative Obligations', 
                                    'Total Cumulative Expenditures']].copy()


all_us_projects_df['State/Territory'] = all_us_projects_df['State/Territory'].map(us_state_to_abbrev)
all_us_projects_df.rename(columns = {'State/Territory':'State'}, inplace = True)

all_us_projects_df[['Adopted Budget','Total Cumulative Obligations',
                   'Total Cumulative Expenditures']] = all_us_projects_df[['Adopted Budget',
       'Total Cumulative Obligations', 'Total Cumulative Expenditures']].replace(['-', ' '] ,'', regex=True)


numeric_cols = ['Adopted Budget',
       'Total Cumulative Obligations', 'Total Cumulative Expenditures']


# convert budget columns to int for summarizing in groupby:
all_us_projects_df = all_us_projects_df.replace(',','', regex=True)
all_us_projects_df[numeric_cols] = all_us_projects_df[numeric_cols].apply(pd.to_numeric)

# all_us_projects_df.dtypes
all_us_projects_df.head(3)

Unnamed: 0,Recipient Name,State,Recipient Type,Completion Status,Project Name,Expenditure Category Group,Expenditure Category,Project Description,Adopted Budget,Total Cumulative Obligations,Total Cumulative Expenditures
0,Woodbury County Iowa,IA,Local Government,Cancelled,LEC Main project,1-Public Health,1.4-Prevention in Congregate Settings (Nursing...,hvac to mitigate covid,,,
1,Lexington-Fayette Urban County Kentucky,KY,Local Government,Cancelled,Housing Stabilization - Salvation Army,6-Revenue Replacement,6.1-Provision of Government Services,financial assistance to salvation army to impr...,,,
2,Lexington-Fayette Urban County Kentucky,KY,Local Government,Cancelled,Non-Profit Capital Grants,6-Revenue Replacement,6.1-Provision of Government Services,the nonprofit capital project grants program i...,,,


In [13]:
# "us_covid_projects_df" is for (3) from Joanna's slack message:
us_covid_projects_df = all_us_projects_df[all_us_projects_df['Project Description'].str.contains('|'.join(search_term_list))]


# print(len(all_states_df['Project Description']))
print(f'The number of rows containing covid/vaccine search criteria terms is {len(us_covid_projects_df["Project Name"])}')
us_covid_projects_df.head()

The number of rows containing covid/vaccine search criteria terms is 8225


Unnamed: 0,Recipient Name,State,Recipient Type,Completion Status,Project Name,Expenditure Category Group,Expenditure Category,Project Description,Adopted Budget,Total Cumulative Obligations,Total Cumulative Expenditures
0,Woodbury County Iowa,IA,Local Government,Cancelled,LEC Main project,1-Public Health,1.4-Prevention in Congregate Settings (Nursing...,hvac to mitigate covid,,,
1,Lexington-Fayette Urban County Kentucky,KY,Local Government,Cancelled,Housing Stabilization - Salvation Army,6-Revenue Replacement,6.1-Provision of Government Services,financial assistance to salvation army to impr...,,,
5,State Of Idaho,ID,State/DC,Cancelled,Reserve for Covid 19 costs,1-Public Health,1.14-Other Public Health Services,additional unanticipated covid medical costs,,,
10,State Of Idaho,ID,State/DC,Cancelled,DHW Home visiting,2-Negative Economic Impacts,2.12-Healthy Childhood Environments: Home Visi...,•\tthe idaho department of health and welfare ...,,,
13,State Of Idaho,ID,State/DC,Cancelled,EMS Ambulance capacity,1-Public Health,1.10-COVID-19 Aid to Impacted Industries,•\tthe idaho legislature appropriated $2500000...,,,


In [None]:
#TODO: Collect and clean data for (1) from Joanna's slack message request



## Sarah Work Area

In [None]:
# import and read the state_summary.csv
# Load csv file(s)
state_summary_sheet = Path("Resources/state_summary.csv")


# Read csv file(s) as a DataFrame
state_summary_df = pd.read_csv(state_summary_sheet, skipinitialspace= True)

print(f'The data types for this dataframe are already formatted as float integers (nice!)\n\n{state_spending_df.dtypes}')
state_summary_df.head()

In [3]:
# create a reduced dataframe from the state_summary_df columns: 
    #'State', 'Total state allocation (from the fed)', 'total state plus total local federal grant', 
    #'Share of state allocation spent', 'Share of state allocation obligated', 'Share of state allocation budgeted', 
    #'Total local allocation (from the fed)', 'Share of local spent', 'Share of local obligated', 'Share of local budgeted', 
    #'Share of state + local spent'

import pandas as pd
from pathlib import Path
# Load csv file(s)
state_summary_sheet = Path("Resources/state_summary.csv")

# Read csv file(s) as a DataFrame
state_summary_df = pd.read_csv(state_summary_sheet, skipinitialspace=True)

# Selecting the desired columns
reduced_df = state_summary_df[['State', 'Total state allocation (from the fed)',
                               'total state plus total local federal grant',
                               'Share of state allocation spent', 'Share of state allocation obligated',
                               'Share of state allocation budgeted', 'Total local allocation (from the fed)',
                               'Share of local spent', 'Share of local obligated', 'Share of local budgeted',
                               'Share of state + local spent']]

# Printing the reduced dataframe
print(reduced_df.head())




FileNotFoundError: [Errno 2] No such file or directory: 'Resources/state_summary.csv'

In [None]:
# merge this data frame with Evan's "state_spending_df". Merge on the state columns.
    # https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
    # https://www.geeksforgeeks.org/how-to-join-pandas-dataframes-using-merge/#



## Aaliyah Work Area

In [None]:
# Using Sarah's combined dataframe, generate a combined bar/line chart
# x-axis will contain state names
# left-side y-axis and bar chart data will show % state funding used.
# right-side y-axis and line chart data will show 'total state plus total local federal grant' dollar amounts

    #source methods: https://towardsdatascience.com/creating-a-dual-axis-combo-chart-in-python-




## Data Exploration and Cleanup:
- Describe here the group's data sets and how they were cleaned for analysis

# Greg Work Area

### CDC Data

In [None]:
#Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import linregress
import scipy.stats as st
import seaborn as sns

In [None]:
#Import vaccination data from csv
vac_df = pd.read_csv('COVID-19_Vaccinations_in_the_United_States_Jurisdiction.csv')

In [None]:
#function formats the CDC dataframe for US jursdictions - see below for input formats
def CDC_format(df,key,date,add_str,drop_values): #key, filt -> str; length -> int; columns, drop_values -> list
    df = df.dropna(subset=[key])
    df = df[vac_df['Date'] == date]
    df = df[~df[key].isin(drop_values)]
    df[key] = add_str + vac_df[key].astype(str)
    df.reset_index(drop = True, inplace = True)
    return df

In [None]:
drop = ['DD2','FM','AS','VI','BP2','IH2','GU','PN','PR','VA2','PW','US','MP','MH']
vac_df = CDC_format(vac_df,'Location','12/28/2022','US_',drop)

In [None]:
vac_df

In [None]:
### Google vac data

In [None]:
#Import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import linregress
import scipy.stats as st
import seaborn as sns
import cartopy.crs as ccrs
import geoviews as gv # noqa
import pyproj
import geopandas as gpd
import hvplot.pandas
import plotly.express as px

In [None]:
#Import vaccination data from google api
vac_df = pd.read_csv('https://storage.googleapis.com/covid19-open-data/v3/latest/vaccinations.csv')

In [None]:
#function formats the google dataframes - see below for input formats
def google_format(df,key,filt,length,columns,drop_values): #key, filt -> str; length -> int; columns, drop_values -> list
    df = df.dropna(subset=[key])
    df = df[df[key].str.contains(filt)]
    mask = (df[key].str.len() == length)
    df = df.loc[mask]
    df = df[columns]
    df = df[~df[key].isin(drop_values)]
    df.reset_index(drop = True, inplace = True)
    return df

In [None]:
#Input values for vaccination data
drop = ['US_AS','US_GU','US_MP','US_PR','US_VI']
cols = ['date','location_key','cumulative_persons_fully_vaccinated','new_persons_vaccinated','new_persons_fully_vaccinated']
loc_key = 'location_key'
contains = 'US_'

In [None]:
#formatting vaccination data
vac_df = google_format(vac_df, loc_key, contains, 5, cols, drop)

In [None]:
mylist = ['Orange','Apple'] #Keywords search
pattern = '|'.join(mylist)
vac_df.location_key.str.contains(pattern)

In [None]:
#reading demographic data
dem_df = pd.read_csv('demographics.csv')

In [None]:
dem_df

In [None]:
dcols = ['location_key','population']

In [None]:
#formatting demographic data
dem_df = google_format(dem_df, loc_key, contains, 5, dcols, drop)

In [None]:
#reading epidemeology data
epi_df = pd.read_csv('https://storage.googleapis.com/covid19-open-data/v3/latest/epidemiology.csv')

In [None]:
ecols = ['location_key','cumulative_confirmed','cumulative_deceased','cumulative_recovered']

In [None]:
#formatting epidemeology data
epi_df = google_format(epi_df, loc_key, contains, 5, ecols, drop)

In [None]:
loc_key = pd.read_csv('https://storage.googleapis.com/covid19-open-data/v3/location/US.csv')

In [None]:
AK_vac_df = US_vac_df[US_vac_df['location_key'].str.contains('US_AK')]

In [None]:
#Looking at only one state - this can be skipped
AK_total = AK_vac_df['cumulative_persons_fully_vaccinated'].iloc[1:len(AK_vac_df)].sum()
AK_total

In [None]:
#we don't need this at the moment, can be skipped
def swap_rows(df, i1, i2): #Keep this!!!
    a, b = df.iloc[i1, :].copy(), df.iloc[i2, :].copy()
    df.iloc[i1, :], df.iloc[i2, :] = b, a
    return df

In [None]:
#merging dataframes
total_df = vac_df.merge(dem_df, how = 'inner',on = 'location_key')

In [None]:

total_df['percent_fully_vaccinated'] = (total_df['cumulative_persons_fully_vaccinated']/total_df['population'])*100
total_df.sort_values('percent_fully_vaccinated', ascending = False)

In [None]:
#merging dataframes
total_df = total_df.merge(epi_df, how = 'inner',on = 'location_key')

In [None]:
total_df['percent_death_rate_by_case'] = (total_df['cumulative_deceased']/total_df['cumulative_confirmed'])*100

In [None]:
total_df['percent_death_rate_per_capita'] = (total_df['cumulative_deceased']/total_df['population'])*100

In [None]:
total_df['percent_confirmed'] = (total_df['cumulative_confirmed']/total_df['population'])*100

In [None]:
total_df['state_code'] = total_df.location_key.str.replace('US_','') #adding the state code for the plotly function

In [None]:
total_df.sort_values('percent_fully_vaccinated', ascending = False)

In [None]:
#function for regression plots
def reg(df,x,y,x_text,y_text):    
    lm = st.linregress(x = df[x], y = df[y])
    data_fit = lm[0]*df[x] + lm[1]
    fit_df = pd.DataFrame({'x': df[x], 'fitted': data_fit})
    ax = sns.scatterplot(data = df, x = x, y = y)
    #ax = df.plot.scatter(y = y, x = x, s = 30)
    print(f"The r-value is: {lm[2]}")
    fit_df.plot.line(x = 'x', y = 'fitted', color = 'red', ax=ax, legend = None, xlabel = x)
    plt.text(x_text,y_text,f"y = {'%.2f' %lm[0]}x + {'%.1f' %lm[1]}", color = 'red', fontsize = 16)

In [None]:
reg(total_df,'percent_fully_vaccinated','percent_death_rate_by_case',50,0.6)

In [None]:
reg(total_df,'percent_fully_vaccinated','percent_death_rate_per_capita',50,0.15)

In [None]:
reg(total_df,'percent_fully_vaccinated','percent_confirmed',50,20)

In [None]:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

world.hvplot(c='country', geo=True)

In [None]:
#generating map of us states - you need to specify the color variable as one of the dataframe columns 
fig = px.choropleth(total_df,
                    locations='state_code', 
                    locationmode="USA-states", 
                    scope="usa",
                    color='percent_death_rate_per_capita',
                    color_continuous_scale="blues" 
                    )
# fig.add_scattergeo(
#     locations=total_df['state_code'],
#     locationmode="USA-states", 
#     text=total_df['state_code'],
#     mode='text',
# )
fig.show()

# Joanna Work Area

In [1]:
#putting Greg's code down here so I can run my area independently of the rest of the sheet without error
#Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
from scipy.stats import linregress
import scipy.stats as st
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

In [2]:
#Import vaccination data from csv
vac_df = pd.read_csv('Resources/COVID-19_Vaccinations_in_the_United_States_Jurisdiction.csv')

  vac_df = pd.read_csv('Resources/COVID-19_Vaccinations_in_the_United_States_Jurisdiction.csv')


In [3]:
#function formats the CDC dataframe for US jursdictions - see below for input formats
def CDC_format(df,key,date,add_str,drop_values): #key, filt -> str; length -> int; columns, drop_values -> list
    df = df.dropna(subset=[key])
    df = df[vac_df['Date'] == date]
    df = df[~df[key].isin(drop_values)]
    df[key] = add_str + vac_df[key].astype(str)
    df.reset_index(drop = True, inplace = True)
    return df

In [4]:
#drop non-state territories from dataframe, select only rows with 12/28/22 data
drop = ['DD2','FM','AS','VI','BP2','IH2','GU','PN','PR','VA2','PW','US','MP','MH']
vac_df = CDC_format(vac_df,'Location','12/28/2022','US_',drop)

In [5]:
#change location to match state code
vac_df['Location'] = vac_df['Location'].str.replace('US_', '')

## To do list
Calculate population number they are using for each state and use it to calculate the Pop_Pct for Administered_Bivalent column

Compare Administered to Recip_Administered to see if there are any significant differences in any state

Make some smaller dataframes for viewing:

a) Whole pop with Distrib, Administered, Dose1, Series Complete, Additional Doses, Second Booster, Administered Bivalent

b) Each individual age group with Dose1, Series Complete, Additional Doses, Second Booster, Bivalent Booster

c) Each category (Dose1, Series Complete, Additional Doses, Second Booster, Bivalent Booster) with all age ranges

Identify which states have a high variance from the mean (general/nationwide population) in % vaccinated (looking at all dosage categories and age categories). This will show us which states were the "good vaccinators" and which the "poor vaccinators." We can then use the EARN data to see if this correlates to how much of the federal money they spent, how many vaccination projects they did, etc.


In [6]:
# get all the columns we will be interested in into one dataframe
# NOTE: there is no Pop_Pct column for the administered_bivalent, and second_booster only for the age breakouts
# but we can extrapolate from their other population calculations to calculate these. For second_booster to get state numbers
# we have to add up the vaccines from the different manufacturers because we don't have them already summed.

vac_df = pd.DataFrame(data=vac_df, columns=["Location", "Distributed", "Administered", "Recip_Administered",
                                                   "Administered_Dose1_Recip", "Administered_Dose1_Pop_Pct",
                                                   "Administered_Dose1_Recip_5Plus", "Administered_Dose1_Recip_5PlusPop_Pct",
                                                  "Administered_Dose1_Recip_12Plus", "Administered_Dose1_Recip_12PlusPop_Pct",
                                                  "Administered_Dose1_Recip_18Plus", "Administered_Dose1_Recip_18PlusPop_Pct",
                                                  "Administered_Dose1_Recip_65Plus", "Administered_Dose1_Recip_65PlusPop_Pct",
                                                  "Series_Complete_Yes", "Series_Complete_Pop_Pct", "Series_Complete_5Plus",
                                                  "Series_Complete_12Plus", "Series_Complete_12PlusPop_Pct",
                                                   "Series_Complete_18Plus", "Series_Complete_18PlusPop_Pct",
                                                   "Series_Complete_65Plus", "Series_Complete_65PlusPop_Pct", "Additional_Doses",
                                                   "Additional_Doses_Vax_Pct", "Additional_Doses_5Plus",
                                                   "Additional_Doses_5Plus_Vax_Pct", "Additional_Doses_12Plus",
                                                   "Additional_Doses_12Plus_Vax_Pct", "Additional_Doses_18Plus",
                                                   "Additional_Doses_18Plus_Vax_Pct", "Additional_Doses_50Plus",
                                                   "Additional_Doses_50Plus_Vax_Pct", "Additional_Doses_65Plus",
                                                   "Additional_Doses_65Plus_Vax_Pct", "Second_Booster_50Plus",
                                                   "Second_Booster_50Plus_Vax_Pct", "Second_Booster_65Plus",
                                                   "Second_Booster_65Plus_Vax_Pct", "Second_Booster_Janssen",
                                                    "Second_Booster_Moderna", "Second_Booster_Pfizer",
                                                    "Second_Booster_Unk_Manuf", "Administered_Bivalent",
                                                   "Bivalent_Booster_5Plus", "Bivalent_Booster_5Plus_Pop_Pct",
                                                   "Bivalent_Booster_12Plus", "Bivalent_Booster_12Plus_Pop_Pct",
                                                   "Bivalent_Booster_18Plus", "Bivalent_Booster_18Plus_Pop_Pct"])

In [33]:
# remove commas from numeric columns
# convert numeric columns to correct type
vac_df = vac_df.replace(',','', regex=True)
numeric_cols = ["Distributed", "Administered", "Recip_Administered", "Administered_Dose1_Recip", "Administered_Dose1_Pop_Pct",
                "Administered_Dose1_Recip_5Plus", "Administered_Dose1_Recip_5PlusPop_Pct", "Administered_Dose1_Recip_12Plus",
                "Administered_Dose1_Recip_12PlusPop_Pct", "Administered_Dose1_Recip_18Plus",
                "Administered_Dose1_Recip_18PlusPop_Pct", "Administered_Dose1_Recip_65Plus",
                "Administered_Dose1_Recip_65PlusPop_Pct", "Series_Complete_Yes", "Series_Complete_Pop_Pct",
                "Series_Complete_5Plus", "Series_Complete_12Plus", "Series_Complete_12PlusPop_Pct", "Series_Complete_18Plus",
                "Series_Complete_18PlusPop_Pct", "Series_Complete_65Plus", "Series_Complete_65PlusPop_Pct", "Additional_Doses",
                "Additional_Doses_Vax_Pct", "Additional_Doses_5Plus", "Additional_Doses_5Plus_Vax_Pct", "Additional_Doses_12Plus",
                "Additional_Doses_12Plus_Vax_Pct", "Additional_Doses_18Plus", "Additional_Doses_18Plus_Vax_Pct",
                "Additional_Doses_50Plus", "Additional_Doses_50Plus_Vax_Pct", "Additional_Doses_65Plus",
                "Additional_Doses_65Plus_Vax_Pct", "Second_Booster_50Plus", "Second_Booster_50Plus_Vax_Pct",
                "Second_Booster_65Plus", "Second_Booster_65Plus_Vax_Pct", "Second_Booster_Janssen",
                "Second_Booster_Moderna", "Second_Booster_Pfizer", "Second_Booster_Unk_Manuf", "Administered_Bivalent",
                "Bivalent_Booster_5Plus", "Bivalent_Booster_5Plus_Pop_Pct", "Bivalent_Booster_12Plus",
                "Bivalent_Booster_12Plus_Pop_Pct", "Bivalent_Booster_18Plus", "Bivalent_Booster_18Plus_Pop_Pct"]
vac_df[numeric_cols] = vac_df[numeric_cols].apply(pd.to_numeric)
vac_df

Unnamed: 0,Location,Distributed,Administered,Recip_Administered,Administered_Dose1_Recip,Administered_Dose1_Pop_Pct,Administered_Dose1_Recip_5Plus,Administered_Dose1_Recip_5PlusPop_Pct,Administered_Dose1_Recip_12Plus,Administered_Dose1_Recip_12PlusPop_Pct,...,Administered_Bivalent,Bivalent_Booster_5Plus,Bivalent_Booster_5Plus_Pop_Pct,Bivalent_Booster_12Plus,Bivalent_Booster_12Plus_Pop_Pct,Bivalent_Booster_18Plus,Bivalent_Booster_18Plus_Pop_Pct,Second_Booster_Total,Pop1,Pop2
0,CT,11421135,8883525,8933212,3636096,95.0,3609191,95.0,3452994,95.0,...,782715,780655,23.1,767615,24.7,742343,26.2,657393,3563964.0,3827469.0
1,NJ,28223715,19503839,20082553,8372496,94.3,8322790,95.0,7964580,95.0,...,1241078,1261744,15.1,1239021,16.3,1199351,17.3,1178279,8885250.0,8878575.0
2,OK,9308930,6660547,6642791,2940863,74.3,2931352,79.2,2846375,85.6,...,399410,397540,10.7,393192,11.8,384048,12.8,336701,3958556.0,3958093.0
3,NE,5229080,3735602,3754048,1414102,73.1,1402138,77.7,1336234,82.6,...,284855,282371,15.7,277007,17.1,267140,18.3,296496,1934447.0,1934476.0
4,DE,3169595,2120412,2094422,853776,87.7,848927,92.4,816603,95.0,...,171972,166740,18.1,164912,19.6,160561,20.8,147425,973843.6,973518.8
5,ME,4718980,3458390,3463706,1302731,95.0,1291558,95.0,1241475,95.0,...,365193,357866,27.9,351413,29.7,341016,31.1,330333,1344463.0,1371296.0
6,HI,4448760,3451416,3486162,1289826,91.1,1279013,95.0,1215234,95.0,...,264138,266673,20.0,261423,21.6,253132,22.7,286603,1415288.0,1415835.0
7,AZ,19039550,14302488,13939670,5610927,77.1,5575844,81.4,5314649,85.6,...,957105,938800,13.7,921888,14.9,890478,15.8,938216,7277000.0,7277467.0
8,KY,11589155,7392531,7478240,3064611,68.6,3049985,72.7,2949714,77.5,...,485457,483242,11.5,477521,12.5,466038,13.5,476673,4467902.0,4467363.0
9,VT,2444660,1718359,1689315,617699,95.0,609299,95.0,577799,95.0,...,184091,182966,30.8,177947,32.3,170461,33.4,168552,623875.7,650209.5


In [30]:
# calculate totals for second booster
vac_df["Second_Booster_Total"] = (vac_df["Second_Booster_Janssen"] + vac_df["Second_Booster_Moderna"]
                                + vac_df["Second_Booster_Pfizer"] + vac_df["Second_Booster_Unk_Manuf"])
# find their population number... ok this is off. ???
vac_df["Pop1"] = vac_df["Series_Complete_Yes"] / (vac_df["Series_Complete_Pop_Pct"]/100)
vac_df["Pop2"] = vac_df["Administered_Dose1_Recip"] / (vac_df["Administered_Dose1_Pop_Pct"]/100)

vac_pops_df = pd.DataFrame(data=vac_df, columns=["Location", "Pop1", "Pop2"])
vac_pops_df


Unnamed: 0,Location,Pop1,Pop2
0,CT,3563964.0,3827469.0
1,NJ,8885250.0,8878575.0
2,OK,3958556.0,3958093.0
3,NE,1934447.0,1934476.0
4,DE,973843.6,973518.8
5,ME,1344463.0,1371296.0
6,HI,1415288.0,1415835.0
7,AZ,7277000.0,7277467.0
8,KY,4467902.0,4467363.0
9,VT,623875.7,650209.5


In [22]:
# df with vax data for all ages
vac_all_ages_df = pd.DataFrame(data=vac_df, columns=["Location", "Distributed", "Administered", "Recip_Administered",
                                                   "Administered_Dose1_Recip", "Administered_Dose1_Pop_Pct",
                                                    "Series_Complete_Yes", "Series_Complete_Pop_Pct",
                                                     "Additional_Doses", "Additional_Doses_Vax_Pct", "Administered_Bivalent"])

# add 'Dose Differential' column to track doses administered to nonresidents. Negative number = doses leaving the state
vac_all_ages_df["Dose Differential"] = vac_all_ages_df["Administered"] - vac_all_ages_df["Recip_Administered"]
vac_all_ages_df["Dose Diff. as Pct of Doses Given"] = abs(vac_all_ages_df["Dose Differential"] / vac_all_ages_df["Administered"])
vac_all_ages_df["Dose Diff. as Pct of Residents Vaxxed"] = abs(vac_all_ages_df["Dose Differential"] / vac_all_ages_df["Recip_Administered"])


In [25]:
vac_dd_df = pd.DataFrame(data=vac_all_ages_df, columns=["Location", "Distributed", "Administered", "Recip_Administered", "Dose Differential",
                         "Dose Diff. as Pct of Doses Given", "Dose Diff. as Pct of Residents Vaxxed", "Administered_Dose1_Pop_Pct", "Series_Complete_Pop_Pct",
                         "Additional_Doses_Vax_Pct"])
vac_dd_df




Unnamed: 0,Location,Distributed,Administered,Recip_Administered,Dose Differential,Dose Diff. as Pct of Doses Given,Dose Diff. as Pct of Residents Vaxxed,Administered_Dose1_Pop_Pct,Series_Complete_Pop_Pct,Additional_Doses_Vax_Pct
0,CT,11421135,8883525,8933212,-49687,0.005593,0.005562,95.0,82.8,55.2
1,NJ,28223715,19503839,20082553,-578714,0.029672,0.028817,94.3,78.8,51.6
2,OK,9308930,6660547,6642791,17756,0.002666,0.002673,74.3,60.2,41.4
3,NE,5229080,3735602,3754048,-18446,0.004938,0.004914,73.1,66.0,55.1
4,DE,3169595,2120412,2094422,25990,0.012257,0.012409,87.7,72.9,50.9
5,ME,4718980,3458390,3463706,-5316,0.001537,0.001535,95.0,83.0,60.6
6,HI,4448760,3451416,3486162,-34746,0.010067,0.009967,91.1,81.3,59.2
7,AZ,19039550,14302488,13939670,362818,0.025367,0.026028,77.1,65.8,49.5
8,KY,11589155,7392531,7478240,-85709,0.011594,0.011461,68.6,59.4,48.2
9,VT,2444660,1718359,1689315,29044,0.016902,0.017193,95.0,85.3,66.1


In [9]:
# Checking on second booster columns -- these NaN values actually exist in the spreadsheet. Is there something going on with
# the function that was used to create the initial dataframe?
vac_secondbooster_df = pd.DataFrame(data=vac_df, columns=["Location", "Second_Booster_50Plus",
                                                   "Second_Booster_50Plus_Vax_Pct", "Second_Booster_65Plus",
                                                   "Second_Booster_65Plus_Vax_Pct", "Second_Booster_Janssen",
                                                    "Second_Booster_Moderna", "Second_Booster_Pfizer",
                                                    "Second_Booster_Unk_Manuf"])


In [11]:
# find how many doses were distributed vs administered
# calculate percent
# sort alphabetically by state
vac_waste_df = pd.DataFrame(data=vac_df, columns=["Location", "Distributed", "Administered"])
vac_waste_df["Pct. Administered"] = vac_waste_df["Administered"] / vac_waste_df["Distributed"]
vac_waste_df.sort_values('Location')

Unnamed: 0,Location,Distributed,Administered,Pct. Administered
39,AK,2066295,1303955,0.631059
41,AL,11897230,6929554,0.582451
12,AR,8034900,4803994,0.597891
7,AZ,19039550,14302488,0.751199
47,CA,115820215,86604013,0.747745
45,CO,16867685,12765099,0.756778
0,CT,11421135,8883525,0.777815
28,DC,2617105,1930871,0.737789
4,DE,3169595,2120412,0.668985
17,FL,58979615,41494703,0.703543


In [15]:
# Show best 10 states in vaccine distribution percentage
vac_waste_best_df = vac_waste_df.sort_values('Pct. Administered', ascending=False)
vac_waste_best_df.head(10)

Unnamed: 0,Location,Distributed,Administered,Pct. Administered
38,MA,22540490,17765199,0.788146
13,NM,5914665,4649331,0.786068
15,WI,15646525,12187825,0.778948
0,CT,11421135,8883525,0.777815
6,HI,4448760,3451416,0.775815
22,RI,3381695,2610412,0.771924
27,NY,58677865,44418984,0.756997
45,CO,16867685,12765099,0.756778
25,VA,25592245,19227095,0.751286
7,AZ,19039550,14302488,0.751199


In [16]:
# Show worst 10 states in vaccine distribution percentage
vac_waste_worst_df = vac_waste_df.sort_values('Pct. Administered', ascending=True)
vac_waste_worst_df.head(10)

Unnamed: 0,Location,Distributed,Administered,Pct. Administered
23,WV,5268855,3063510,0.581438
41,AL,11897230,6929554,0.582451
12,AR,8034900,4803994,0.597891
50,NH,4844640,2906126,0.599864
19,MS,6967035,4248212,0.609759
33,GA,27477235,16829103,0.612474
31,IN,17931670,11008571,0.613918
46,ID,4537400,2837380,0.625332
35,SC,13521655,8522851,0.630311
39,AK,2066295,1303955,0.631059


# Kendal Work Area

In [None]:
#putting Greg's code down here so I can run my area independently of the rest of the sheet without error
#Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
from scipy.stats import linregress
import scipy.stats as st
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

In [None]:
#Import vaccination data from csv
vac_df = pd.read_csv('COVID-19_Vaccinations_in_the_United_States_Jurisdiction.csv')

In [None]:
#function formats the CDC dataframe for US jursdictions - see below for input formats
def CDC_format(df,key,date,add_str,drop_values): #key, filt -> str; length -> int; columns, drop_values -> list
    df = df.dropna(subset=[key])
    df = df[vac_df['Date'] == date]
    df = df[~df[key].isin(drop_values)]
    df[key] = add_str + vac_df[key].astype(str)
    df.reset_index(drop = True, inplace = True)
    return df

In [None]:
#drop non-state territories from dataframe
drop = ['DD2','FM','AS','VI','BP2','IH2','GU','PN','PR','VA2','PW','US','MP','MH']
vac_df = CDC_format(vac_df,'Location','12/28/2022','US_',drop)

In [None]:
#change location to match state code for choropleth maps
vac_df['Location'] = vac_df['Location'].str.replace('US_', '')
vac_df = vac_df.rename(columns={'Series_Complete_Pop_Pct':'Percentage of Population Fully Vaccinated',
                            'Administered_Dose1_Pop_Pct':'% of Population with at least 1 dose',
                            'Series_Complete_65PlusPop_Pct': '% of Population Fully Vaccinated (65+)',
                            'Bivalent_Booster_65Plus_Pop_Pct': '% of Population with bivalent booster (65+)'}) 

In [None]:
#Go.choropleth method (https://plotly.com/python/choropleth-maps/)
fig = go.Figure(data=go.Choropleth(
    locations=vac_df['Location'], # Spatial coordinates
    z = vac_df['Percentage of Population Fully Vaccinated'].astype(float), # Data to be color-coded
    locationmode = 'USA-states', # set of locations match entries in `locations`
    colorscale = 'Viridis',
    colorbar_title = "Percentage of Population Fully Vaccinated",
))

fig.update_layout(
    title_text = 'Vaccination Rates by State',
    geo_scope='usa', # limite map scope to USA
)

fig.show()

In [None]:
fig_complete_total_pop = px.choropleth(vac_df,
                    locations='Location',
                    locationmode="USA-states",
                    scope="usa",
                    color='Percentage of Population Fully Vaccinated',
                    color_continuous_scale="aggrnyl",
                    )

In [None]:
fig_at_least_1 = px.choropleth(vac_df,
                    locations='Location',
                    locationmode="USA-states",
                    scope="usa",
                    color='% of Population with at least 1 dose',
                    color_continuous_scale="twilight",
                    )

In [None]:
fig_complete_65_plus = px.choropleth(vac_df,
                    locations='Location',
                    locationmode="USA-states",
                    scope="usa",
                    color='% of Population Fully Vaccinated (65+)',
                    color_continuous_scale="algae",                 
                    )

In [None]:
fig_bivalent_booster_65 = px.choropleth(vac_df,
                    locations='Location',
                    locationmode="USA-states",
                    scope="usa",
                    color='% of Population with bivalent booster (65+)',
                    color_continuous_scale="icefire",                  
                    )

In [None]:
fig_complete_total_pop

In [None]:
fig_at_least_1

In [None]:
fig_complete_65_plus

In [None]:
fig_bivalent_booster_65

In [None]:
#Making percent fully vaccinated into a list
percent_fully_vaccinated = vac_df['Percentage of Population Fully Vaccinated'].to_numpy()
print(percent_fully_vaccinated)

In [None]:
#boxplots showing spread of data across all 50 states and DC for selected columns
boxplot = vac_df.boxplot(column=['Percentage of Population Fully Vaccinated', 
                                 '% of Population with at least 1 dose', 
                                 '% of Population Fully Vaccinated (65+)', 
                                 '% of Population with bivalent booster (65+)'], 
                         rot=45,
                         grid=True,
                         figsize = (15,10),
                        )
plt.title("Distribution of Vaccination Rates Across U.S. States")

In [None]:
#summary statistics for selected columns (across all 50 states and DC) 
vac_df_short = vac_df[['Series_Complete_Pop_Pct', 'Administered_Dose1_Pop_Pct', 'Series_Complete_65PlusPop_Pct', 'Bivalent_Booster_65Plus_Pop_Pct']]

vac_df_short.describe()

# Sarah Work Area

# Aaliyah Work Area