# A Comparison of State Use of SFLRF Funds for Vaccination Programs and Vaccination Rates in Each State



### Data Sources:
CDC - "COVID-19 Vaccinations in the United States, Jurisdiction"
csv downloaded 5/11/23
https://data.cdc.gov/Vaccinations/COVID-19-Vaccinations-in-the-United-States-Jurisdi/unsk-b7fc

EARN/EPI - "EARN SLFRF Workbook for Q4 2022" compiled by Dave Kamper of the Economic Policy Institute (dkamper@epi.org) from Treasury reports by states and local jurisidictions who received funding, and other data sources as detailed in the workbook.

## Production Code (Team: Put your code here after it is complete and ready to go)

## Evan Work Area

In [32]:
# import dependencies and setup
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pprint import pprint
from pathlib import Path

In [33]:
# Load csv file(s)
all_states_sheet = Path("Resources/EARN_all_states.csv")


# Read csv file(s) as a DataFrame
all_states_df = pd.read_csv(all_states_sheet, skipinitialspace= True)


# preview the raw DataFrame
print(len(all_states_df['Project ID']))
all_states_df.head()


all_states_df.columns = all_states_df.columns.str.strip()

print(all_states_df.columns)

35710
Index(['Project ID', 'Recipient-ID', 'Recipient Name', 'State/Territory',
       'StateList', 'Reporting Tier', 'Recipient Type', 'Completion Status',
       'Project Name', 'Expenditure Category Group', 'Expenditure Category',
       'Project Description', 'Adopted Budget', 'Total Cumulative Obligations',
       'Total Cumulative Expenditures',
       'Community benefit agreement? (Infrastructure Only)',
       'Complying with David Bacon? (Infrastructure Only)',
       'Project labor agreement? (Infrastructure Only)',
       'Primary Demographic Served (Select Expenditure Categories Only)'],
      dtype='object')



Columns (15,16,17) have mixed types. Specify dtype option on import or set low_memory=False.



In [34]:
# Review list of NA values in the 'Project Description' column
nan_values = all_states_df[all_states_df['Project Description'].isna()]

# print(len(nan_values))
print(f'There are {len(nan_values)} columns with NA values in "Project Description" column:')

#nan_values

There are 4 columns with NA values in "Project Description" column:


In [35]:
# Drop these rows where the column has NaN value
    # source: https://towardsdatascience.com/how-to-drop-rows-in-pandas-dataframes-with-nan-values-in-certain-columns-7613ad1a7f25
    
all_states_df = all_states_df.dropna(subset=['Project Description'], how='all')

# confirm 4 rows were dropped by reviewing column length count:

print(f'The DataFrame now has {len(all_states_df["Project ID"])} rows of data.')
all_states_df.head(1)


The DataFrame now has 35706 rows of data.


Unnamed: 0,Project ID,Recipient-ID,Recipient Name,State/Territory,StateList,Reporting Tier,Recipient Type,Completion Status,Project Name,Expenditure Category Group,Expenditure Category,Project Description,Adopted Budget,Total Cumulative Obligations,Total Cumulative Expenditures,Community benefit agreement? (Infrastructure Only),Complying with David Bacon? (Infrastructure Only),Project labor agreement? (Infrastructure Only),Primary Demographic Served (Select Expenditure Categories Only)
0,TPN-039343,RCP-039196,"Woodbury County, Iowa",Iowa,Iowa,Tier 2. Metropolitan cities and counties with...,Local Government,Cancelled,LEC Main project,1-Public Health,1.4-Prevention in Congregate Settings (Nursing...,HVAC to mitigate covid,-,-,-,,,,1 Imp General Public


In [36]:
# Make the Project Description values all lowercase for value search:
all_states_df['Project Description'] = all_states_df['Project Description'].str.lower()

print(f'The Project Description column has been set to lowercase for all string values:')
all_states_df.head(2)

The Project Description column has been set to lowercase for all string values:


Unnamed: 0,Project ID,Recipient-ID,Recipient Name,State/Territory,StateList,Reporting Tier,Recipient Type,Completion Status,Project Name,Expenditure Category Group,Expenditure Category,Project Description,Adopted Budget,Total Cumulative Obligations,Total Cumulative Expenditures,Community benefit agreement? (Infrastructure Only),Complying with David Bacon? (Infrastructure Only),Project labor agreement? (Infrastructure Only),Primary Demographic Served (Select Expenditure Categories Only)
0,TPN-039343,RCP-039196,"Woodbury County, Iowa",Iowa,Iowa,Tier 2. Metropolitan cities and counties with...,Local Government,Cancelled,LEC Main project,1-Public Health,1.4-Prevention in Congregate Settings (Nursing...,hvac to mitigate covid,-,-,-,,,,1 Imp General Public
1,TPN-039461,RCP-036070,"Lexington-Fayette Urban County, Kentucky",Kentucky,Kentucky,"Tier 1. States, U.S. territories, metropolitan...",Local Government,Cancelled,Housing Stabilization - Salvation Army,6-Revenue Replacement,6.1-Provision of Government Services,financial assistance to salvation army to impr...,-,-,-,,,,


In [37]:
# Brainstorm a list of words to filter the 'Project Description' column by.
    ## this list will be used to filter that column so that we are only working with projects that
    ## are actually covid related.
    
# TODO: confirm string case does not affect search results. eg) lowercase moderna vs Moderna.
search_term_list = ['covid', 'covid-19', 'vaccine', 'vaccination', 'vaccinated', 'moderna', 'pfizer', 'johnson & johnson', 'janssen']



In [38]:
# Filter the dataframe column 'Project Description'
    ## source: https://stackoverflow.com/questions/28679930/how-to-drop-rows-from-pandas-data-frame-that-contains-a-particular-string-in-a-p

    
covid_projects_df = all_states_df[all_states_df['Project Description'].str.contains('|'.join(search_term_list))]


# print(len(all_states_df['Project Description']))
print(f'The number of rows containing covid/vaccine search criteria terms is {len(covid_projects_df["Project ID"])}')
covid_projects_df.head(4)


The number of rows containing covid/vaccine search criteria terms is 8225


Unnamed: 0,Project ID,Recipient-ID,Recipient Name,State/Territory,StateList,Reporting Tier,Recipient Type,Completion Status,Project Name,Expenditure Category Group,Expenditure Category,Project Description,Adopted Budget,Total Cumulative Obligations,Total Cumulative Expenditures,Community benefit agreement? (Infrastructure Only),Complying with David Bacon? (Infrastructure Only),Project labor agreement? (Infrastructure Only),Primary Demographic Served (Select Expenditure Categories Only)
0,TPN-039343,RCP-039196,"Woodbury County, Iowa",Iowa,Iowa,Tier 2. Metropolitan cities and counties with...,Local Government,Cancelled,LEC Main project,1-Public Health,1.4-Prevention in Congregate Settings (Nursing...,hvac to mitigate covid,-,-,-,,,,1 Imp General Public
1,TPN-039461,RCP-036070,"Lexington-Fayette Urban County, Kentucky",Kentucky,Kentucky,"Tier 1. States, U.S. territories, metropolitan...",Local Government,Cancelled,Housing Stabilization - Salvation Army,6-Revenue Replacement,6.1-Provision of Government Services,financial assistance to salvation army to impr...,-,-,-,,,,
5,TPN-055785,RCP-035970,State Of Idaho,Idaho,Idaho,"Tier 1. States, U.S. territories, metropolitan...",State/DC,Cancelled,Reserve for Covid 19 costs,1-Public Health,1.14-Other Public Health Services,additional unanticipated covid medical costs,-,-,-,,,,1 Imp General Public
10,TPN-056253,RCP-035970,State Of Idaho,Idaho,Idaho,"Tier 1. States, U.S. territories, metropolitan...",State/DC,Cancelled,DHW Home visiting,2-Negative Economic Impacts,2.12-Healthy Childhood Environments: Home Visi...,•\tthe idaho department of health and welfare ...,-,-,-,,,,14 Dis Imp Low income HHs and populations


In [39]:
# Now format all budget related columns as integers for summing in the .groupby step:
# note that pandas imported the csv columns as an object type and not strings/ints, etc:

# print(all_states_df.dtypes)
print(f'\n----------------------------\n')
print(covid_projects_df.dtypes)


----------------------------

Project ID                                                         object
Recipient-ID                                                       object
Recipient Name                                                     object
State/Territory                                                    object
StateList                                                          object
Reporting Tier                                                     object
Recipient Type                                                     object
Completion Status                                                  object
Project Name                                                       object
Expenditure Category Group                                         object
Expenditure Category                                               object
Project Description                                                object
Adopted Budget                                                     object
Total C

In [40]:
# clean up values preventing change of data type to int
covid_projects_df[['Adopted Budget','Total Cumulative Obligations',
                   'Total Cumulative Expenditures']] = covid_projects_df[['Adopted Budget',
       'Total Cumulative Obligations', 'Total Cumulative Expenditures']].replace(['-', ' '] ,'', regex=True)


numeric_cols = ['Adopted Budget',
       'Total Cumulative Obligations', 'Total Cumulative Expenditures']


# convert budget columns to int for summarizing in groupby:
covid_projects_df = covid_projects_df.replace(',','', regex=True)
covid_projects_df[numeric_cols] = covid_projects_df[numeric_cols].apply(pd.to_numeric)


print(covid_projects_df['Adopted Budget'].unique())

# print(covid_projects_df.dtypes)

covid_projects_df.head(3)


[       nan 1000000.     28300.   ...  205796.55 1705540.      8265.39]




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Project ID,Recipient-ID,Recipient Name,State/Territory,StateList,Reporting Tier,Recipient Type,Completion Status,Project Name,Expenditure Category Group,Expenditure Category,Project Description,Adopted Budget,Total Cumulative Obligations,Total Cumulative Expenditures,Community benefit agreement? (Infrastructure Only),Complying with David Bacon? (Infrastructure Only),Project labor agreement? (Infrastructure Only),Primary Demographic Served (Select Expenditure Categories Only)
0,TPN-039343,RCP-039196,Woodbury County Iowa,Iowa,Iowa,Tier 2. Metropolitan cities and counties with...,Local Government,Cancelled,LEC Main project,1-Public Health,1.4-Prevention in Congregate Settings (Nursing...,hvac to mitigate covid,,,,,,,1 Imp General Public
1,TPN-039461,RCP-036070,Lexington-Fayette Urban County Kentucky,Kentucky,Kentucky,Tier 1. States U.S. territories metropolitan c...,Local Government,Cancelled,Housing Stabilization - Salvation Army,6-Revenue Replacement,6.1-Provision of Government Services,financial assistance to salvation army to impr...,,,,,,,
5,TPN-055785,RCP-035970,State Of Idaho,Idaho,Idaho,Tier 1. States U.S. territories metropolitan c...,State/DC,Cancelled,Reserve for Covid 19 costs,1-Public Health,1.14-Other Public Health Services,additional unanticipated covid medical costs,,,,,,,1 Imp General Public


In [41]:
# Try to group the filtered dataframe by state, summing applicable $ value columns
    ## if we get errors, then we need to clean columns causing errors. 
    ## eg) 'Adopted Budget' column has values containing "-". This might prevent the .sum() function from working

# example) covid_projects_df.groupby(['State/Territory']).sum(['Adopted Budget', 'Total Cumulative Obligations', 'Total Cumulative Expenditures'])


state_spending_df = covid_projects_df.groupby(['State/Territory'], as_index=False).sum(['Adopted Budget', 'Total Cumulative Obligations', 
                                                                        'Total Cumulative Expenditures'])

print(f'The column headers for the state_spending_df are:\n\n {state_spending_df.columns}')
state_spending_df.head()

The column headers for the state_spending_df are:

 Index(['State/Territory', 'Adopted Budget', 'Total Cumulative Obligations',
       'Total Cumulative Expenditures'],
      dtype='object')


Unnamed: 0,State/Territory,Adopted Budget,Total Cumulative Obligations,Total Cumulative Expenditures
0,Alabama,341856900.0,239994500.0,192230300.0
1,Alaska,89922310.0,50857970.0,47895680.0
2,American Samoa,447866300.0,32768640.0,30226340.0
3,Arizona,1530397000.0,991496100.0,693236200.0
4,Arkansas,161146800.0,159726000.0,144292500.0


In [42]:
# Add column of state name abbreviations:
# source: https://gist.github.com/rogerallen/1583593

us_state_to_abbrev = {
    "Alabama": "AL",
    "Alaska": "AK",
    "Arizona": "AZ",
    "Arkansas": "AR",
    "California": "CA",
    "Colorado": "CO",
    "Connecticut": "CT",
    "Delaware": "DE",
    "Florida": "FL",
    "Georgia": "GA",
    "Hawaii": "HI",
    "Idaho": "ID",
    "Illinois": "IL",
    "Indiana": "IN",
    "Iowa": "IA",
    "Kansas": "KS",
    "Kentucky": "KY",
    "Louisiana": "LA",
    "Maine": "ME",
    "Maryland": "MD",
    "Massachusetts": "MA",
    "Michigan": "MI",
    "Minnesota": "MN",
    "Mississippi": "MS",
    "Missouri": "MO",
    "Montana": "MT",
    "Nebraska": "NE",
    "Nevada": "NV",
    "New Hampshire": "NH",
    "New Jersey": "NJ",
    "New Mexico": "NM",
    "New York": "NY",
    "North Carolina": "NC",
    "North Dakota": "ND",
    "Ohio": "OH",
    "Oklahoma": "OK",
    "Oregon": "OR",
    "Pennsylvania": "PA",
    "Rhode Island": "RI",
    "South Carolina": "SC",
    "South Dakota": "SD",
    "Tennessee": "TN",
    "Texas": "TX",
    "Utah": "UT",
    "Vermont": "VT",
    "Virginia": "VA",
    "Washington": "WA",
    "West Virginia": "WV",
    "Wisconsin": "WI",
    "Wyoming": "WY",
    "District of Columbia": "DC",
    "American Samoa": "AS",
    "Guam": "GU",
    "Northern Mariana Islands": "MP",
    "Puerto Rico": "PR",
    "United States Minor Outlying Islands": "UM",
    "U.S. Virgin Islands": "VI",
}
    
# add abbreviated state name column and reorder so the abbrev is after full state name column:
state_spending_df['Location'] = state_spending_df['State/Territory'].map(us_state_to_abbrev)
state_spending_df = state_spending_df[['State/Territory', 'Location', 'Adopted Budget', 
                                       'Total Cumulative Obligations', 'Total Cumulative Expenditures']]

state_spending_df.head()


Unnamed: 0,State/Territory,Location,Adopted Budget,Total Cumulative Obligations,Total Cumulative Expenditures
0,Alabama,AL,341856900.0,239994500.0,192230300.0
1,Alaska,AK,89922310.0,50857970.0,47895680.0
2,American Samoa,AS,447866300.0,32768640.0,30226340.0
3,Arizona,AZ,1530397000.0,991496100.0,693236200.0
4,Arkansas,AR,161146800.0,159726000.0,144292500.0


In [43]:
# "all_us_projects_df" is for (2) from Joanna's slack message request:
all_us_projects_df = all_states_df[['Recipient Name', 'State/Territory', 'Recipient Type', 
                                    'Completion Status', 'Project Name', 'Expenditure Category Group', 'Expenditure Category', 
                                    'Project Description', 'Adopted Budget', 'Total Cumulative Obligations', 
                                    'Total Cumulative Expenditures']].copy()


all_us_projects_df['State/Territory'] = all_us_projects_df['State/Territory'].map(us_state_to_abbrev)
all_us_projects_df.rename(columns = {'State/Territory':'State'}, inplace = True)

all_us_projects_df[['Adopted Budget','Total Cumulative Obligations',
                   'Total Cumulative Expenditures']] = all_us_projects_df[['Adopted Budget',
       'Total Cumulative Obligations', 'Total Cumulative Expenditures']].replace(['-', ' '] ,'', regex=True)


numeric_cols = ['Adopted Budget',
       'Total Cumulative Obligations', 'Total Cumulative Expenditures']


# convert budget columns to int for summarizing in groupby:
all_us_projects_df = all_us_projects_df.replace(',','', regex=True)
all_us_projects_df[numeric_cols] = all_us_projects_df[numeric_cols].apply(pd.to_numeric)

# all_us_projects_df.dtypes
all_us_projects_df.head(3)

Unnamed: 0,Recipient Name,State,Recipient Type,Completion Status,Project Name,Expenditure Category Group,Expenditure Category,Project Description,Adopted Budget,Total Cumulative Obligations,Total Cumulative Expenditures
0,Woodbury County Iowa,IA,Local Government,Cancelled,LEC Main project,1-Public Health,1.4-Prevention in Congregate Settings (Nursing...,hvac to mitigate covid,,,
1,Lexington-Fayette Urban County Kentucky,KY,Local Government,Cancelled,Housing Stabilization - Salvation Army,6-Revenue Replacement,6.1-Provision of Government Services,financial assistance to salvation army to impr...,,,
2,Lexington-Fayette Urban County Kentucky,KY,Local Government,Cancelled,Non-Profit Capital Grants,6-Revenue Replacement,6.1-Provision of Government Services,the nonprofit capital project grants program i...,,,


In [44]:
# "us_covid_projects_df" is for (3) from Joanna's slack message:
us_covid_projects_df = all_us_projects_df[all_us_projects_df['Project Description'].str.contains('|'.join(search_term_list))]


# print(len(all_states_df['Project Description']))
print(f'The number of rows containing covid/vaccine search criteria terms is {len(us_covid_projects_df["Project Name"])}')
us_covid_projects_df.head()

The number of rows containing covid/vaccine search criteria terms is 8225


Unnamed: 0,Recipient Name,State,Recipient Type,Completion Status,Project Name,Expenditure Category Group,Expenditure Category,Project Description,Adopted Budget,Total Cumulative Obligations,Total Cumulative Expenditures
0,Woodbury County Iowa,IA,Local Government,Cancelled,LEC Main project,1-Public Health,1.4-Prevention in Congregate Settings (Nursing...,hvac to mitigate covid,,,
1,Lexington-Fayette Urban County Kentucky,KY,Local Government,Cancelled,Housing Stabilization - Salvation Army,6-Revenue Replacement,6.1-Provision of Government Services,financial assistance to salvation army to impr...,,,
5,State Of Idaho,ID,State/DC,Cancelled,Reserve for Covid 19 costs,1-Public Health,1.14-Other Public Health Services,additional unanticipated covid medical costs,,,
10,State Of Idaho,ID,State/DC,Cancelled,DHW Home visiting,2-Negative Economic Impacts,2.12-Healthy Childhood Environments: Home Visi...,•\tthe idaho department of health and welfare ...,,,
13,State Of Idaho,ID,State/DC,Cancelled,EMS Ambulance capacity,1-Public Health,1.10-COVID-19 Aid to Impacted Industries,•\tthe idaho legislature appropriated $2500000...,,,


In [45]:
#TODO: Collect and clean data for (1) from Joanna's slack message request



## Sarah Work Area

In [46]:
# import and read the state_summary.csv
# Load csv file(s)
state_summary_sheet = Path("Resources/state_summary.csv")


# Read csv file(s) as a DataFrame
state_summary_df = pd.read_csv(state_summary_sheet, skipinitialspace= True)

print(f'The data types for this dataframe are already formatted as float integers (nice!)\n\n{state_spending_df.dtypes}')
state_summary_df.head()

The data types for this dataframe are already formatted as float integers (nice!)

State/Territory                   object
Location                          object
Adopted Budget                   float64
Total Cumulative Obligations     float64
Total Cumulative Expenditures    float64
dtype: object


Unnamed: 0,State,Total state allocation (from the fed),total state plus total local federal grant,Total state spending,"Spent as of Sept 30, 2022",Total state obligated,Total state budgeted,Share of state allocation spent,Share of state allocation obligated,Share of state allocation budgeted,...,Share of local spent,Share of local obligated,Share of local budgeted,Share of state + local spent,Change in state spending since Sept (as share of total allocation),Change in local spending since Sept,Change in local government employment (inclusing public education) from Feb 2020 to Jan 2023,"Percentage change in local government employment, February 2020-Jan 2023","Change in state government jobs, Feb 2020 to Jan 2023 (thousands","Percentage change in state government jobs, Feb 2020 to Jan 2023"
0,Alabama,"$2,120,279,417","$3,287,582,722","$348,913,764","$340,112,472","$350,199,320","$1,060,139,709",16.5%,16.5%,50.0%,...,20.5%,35.0%,23.6%,18%,0.42%,4.1%,0.4,0.18%,0.3,0.25%
1,Alaska,"$1,011,788,220","$1,166,360,017","$865,562,003","$805,280,930","$884,653,257","$1,001,201,989",85.5%,87.4%,99.0%,...,62.5%,70.4%,78.0%,82%,5.96%,31.2%,-1.7,-4.10%,-0.5,-2.20%
2,Arizona,"$4,182,827,492","$6,621,288,758","$2,120,555,074","$1,923,020,697","$2,496,788,343","$2,792,726,506",50.7%,59.7%,66.8%,...,30.6%,43.3%,76.1%,43%,4.72%,3.3%,-12.0,-4.34%,0.0,0.00%
3,Arkansas,"$1,573,121,581","$2,112,900,112","$616,773,435","$546,907,964","$660,527,986","$767,344,936",39.2%,42.0%,48.8%,...,32.4%,49.3%,30.9%,37%,4.44%,8.1%,-2.9,-2.53%,-2.1,-2.68%
4,California,"$27,017,016,860","$41,419,307,889","$20,188,839,813","$19,629,506,051","$24,826,648,677","$26,933,816,205",74.7%,91.9%,99.7%,...,37.1%,46.4%,67.2%,62%,2.07%,4.3%,-60.2,-3.29%,7.4,1.37%


In [None]:
# create a reduced dataframe from the state_summary_df columns: 
    #'State', 'Total state allocation (from the fed)', 'total state plus total local federal grant', 
    #'Share of state allocation spent', 'Share of state allocation obligated', 'Share of state allocation budgeted', 
    #'Total local allocation (from the fed)', 'Share of local spent', 'Share of local obligated', 'Share of local budgeted', 
    #'Share of state + local spent'





In [None]:
# merge this data frame with Evan's "state_spending_df". Merge on the state columns.
    # https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
    # https://www.geeksforgeeks.org/how-to-join-pandas-dataframes-using-merge/#



## Aaliyah Work Area

In [None]:
# Using Sarah's combined dataframe, generate a combined bar/line chart
# x-axis will contain state names
# left-side y-axis and bar chart data will show % state funding used.
# right-side y-axis and line chart data will show 'total state plus total local federal grant' dollar amounts

    #source methods: https://towardsdatascience.com/creating-a-dual-axis-combo-chart-in-python-




## Data Exploration and Cleanup:
- Describe here the group's data sets and how they were cleaned for analysis

# Greg Work Area

### CDC Data

In [None]:
#Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import linregress
import scipy.stats as st
import seaborn as sns

In [None]:
#Import vaccination data from csv
vac_df = pd.read_csv('COVID-19_Vaccinations_in_the_United_States_Jurisdiction.csv')

In [None]:
#function formats the CDC dataframe for US jursdictions - see below for input formats
def CDC_format(df,key,date,add_str,drop_values): #key, filt -> str; length -> int; columns, drop_values -> list
    df = df.dropna(subset=[key])
    df = df[vac_df['Date'] == date]
    df = df[~df[key].isin(drop_values)]
    df[key] = add_str + vac_df[key].astype(str)
    df.reset_index(drop = True, inplace = True)
    return df

In [None]:
drop = ['DD2','FM','AS','VI','BP2','IH2','GU','PN','PR','VA2','PW','US','MP','MH']
vac_df = CDC_format(vac_df,'Location','12/28/2022','US_',drop)

In [None]:
vac_df

In [None]:
### Google vac data

In [None]:
#Import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import linregress
import scipy.stats as st
import seaborn as sns
import cartopy.crs as ccrs
import geoviews as gv # noqa
import pyproj
import geopandas as gpd
import hvplot.pandas
import plotly.express as px

In [None]:
#Import vaccination data from google api
vac_df = pd.read_csv('https://storage.googleapis.com/covid19-open-data/v3/latest/vaccinations.csv')

In [None]:
#function formats the google dataframes - see below for input formats
def google_format(df,key,filt,length,columns,drop_values): #key, filt -> str; length -> int; columns, drop_values -> list
    df = df.dropna(subset=[key])
    df = df[df[key].str.contains(filt)]
    mask = (df[key].str.len() == length)
    df = df.loc[mask]
    df = df[columns]
    df = df[~df[key].isin(drop_values)]
    df.reset_index(drop = True, inplace = True)
    return df

In [None]:
#Input values for vaccination data
drop = ['US_AS','US_GU','US_MP','US_PR','US_VI']
cols = ['date','location_key','cumulative_persons_fully_vaccinated','new_persons_vaccinated','new_persons_fully_vaccinated']
loc_key = 'location_key'
contains = 'US_'

In [None]:
#formatting vaccination data
vac_df = google_format(vac_df, loc_key, contains, 5, cols, drop)

In [None]:
mylist = ['Orange','Apple'] #Keywords search
pattern = '|'.join(mylist)
vac_df.location_key.str.contains(pattern)

In [None]:
#reading demographic data
dem_df = pd.read_csv('demographics.csv')

In [None]:
dem_df

In [None]:
dcols = ['location_key','population']

In [None]:
#formatting demographic data
dem_df = google_format(dem_df, loc_key, contains, 5, dcols, drop)

In [None]:
#reading epidemeology data
epi_df = pd.read_csv('https://storage.googleapis.com/covid19-open-data/v3/latest/epidemiology.csv')

In [None]:
ecols = ['location_key','cumulative_confirmed','cumulative_deceased','cumulative_recovered']

In [None]:
#formatting epidemeology data
epi_df = google_format(epi_df, loc_key, contains, 5, ecols, drop)

In [None]:
loc_key = pd.read_csv('https://storage.googleapis.com/covid19-open-data/v3/location/US.csv')

In [None]:
AK_vac_df = US_vac_df[US_vac_df['location_key'].str.contains('US_AK')]

In [None]:
#Looking at only one state - this can be skipped
AK_total = AK_vac_df['cumulative_persons_fully_vaccinated'].iloc[1:len(AK_vac_df)].sum()
AK_total

In [None]:
#we don't need this at the moment, can be skipped
def swap_rows(df, i1, i2): #Keep this!!!
    a, b = df.iloc[i1, :].copy(), df.iloc[i2, :].copy()
    df.iloc[i1, :], df.iloc[i2, :] = b, a
    return df

In [None]:
#merging dataframes
total_df = vac_df.merge(dem_df, how = 'inner',on = 'location_key')

In [None]:

total_df['percent_fully_vaccinated'] = (total_df['cumulative_persons_fully_vaccinated']/total_df['population'])*100
total_df.sort_values('percent_fully_vaccinated', ascending = False)

In [None]:
#merging dataframes
total_df = total_df.merge(epi_df, how = 'inner',on = 'location_key')

In [None]:
total_df['percent_death_rate_by_case'] = (total_df['cumulative_deceased']/total_df['cumulative_confirmed'])*100

In [None]:
total_df['percent_death_rate_per_capita'] = (total_df['cumulative_deceased']/total_df['population'])*100

In [None]:
total_df['percent_confirmed'] = (total_df['cumulative_confirmed']/total_df['population'])*100

In [None]:
total_df['state_code'] = total_df.location_key.str.replace('US_','') #adding the state code for the plotly function

In [None]:
total_df.sort_values('percent_fully_vaccinated', ascending = False)

In [None]:
#function for regression plots
def reg(df,x,y,x_text,y_text):    
    lm = st.linregress(x = df[x], y = df[y])
    data_fit = lm[0]*df[x] + lm[1]
    fit_df = pd.DataFrame({'x': df[x], 'fitted': data_fit})
    ax = sns.scatterplot(data = df, x = x, y = y)
    #ax = df.plot.scatter(y = y, x = x, s = 30)
    print(f"The r-value is: {lm[2]}")
    fit_df.plot.line(x = 'x', y = 'fitted', color = 'red', ax=ax, legend = None, xlabel = x)
    plt.text(x_text,y_text,f"y = {'%.2f' %lm[0]}x + {'%.1f' %lm[1]}", color = 'red', fontsize = 16)

In [None]:
reg(total_df,'percent_fully_vaccinated','percent_death_rate_by_case',50,0.6)

In [None]:
reg(total_df,'percent_fully_vaccinated','percent_death_rate_per_capita',50,0.15)

In [None]:
reg(total_df,'percent_fully_vaccinated','percent_confirmed',50,20)

In [None]:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

world.hvplot(c='country', geo=True)

In [None]:
#generating map of us states - you need to specify the color variable as one of the dataframe columns 
fig = px.choropleth(total_df,
                    locations='state_code', 
                    locationmode="USA-states", 
                    scope="usa",
                    color='percent_death_rate_per_capita',
                    color_continuous_scale="blues" 
                    )
# fig.add_scattergeo(
#     locations=total_df['state_code'],
#     locationmode="USA-states", 
#     text=total_df['state_code'],
#     mode='text',
# )
fig.show()

# Joanna Work Area

In [None]:
#putting Greg's code down here so I can run my area independently of the rest of the sheet without error
#Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
from scipy.stats import linregress
import scipy.stats as st
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

In [None]:
#Import vaccination data from csv
vac_df = pd.read_csv('Resources/COVID-19_Vaccinations_in_the_United_States_Jurisdiction.csv')

In [None]:
#function formats the CDC dataframe for US jursdictions - see below for input formats
def CDC_format(df,key,date,add_str,drop_values): #key, filt -> str; length -> int; columns, drop_values -> list
    df = df.dropna(subset=[key])
    df = df[vac_df['Date'] == date]
    df = df[~df[key].isin(drop_values)]
    df[key] = add_str + vac_df[key].astype(str)
    df.reset_index(drop = True, inplace = True)
    return df

In [None]:
#drop non-state territories from dataframe, select only rows with 12/28/22 data
drop = ['DD2','FM','AS','VI','BP2','IH2','GU','PN','PR','VA2','PW','US','MP','MH']
vac_df = CDC_format(vac_df,'Location','12/28/2022','US_',drop)

In [None]:
#change location to match state code
vac_df['Location'] = vac_df['Location'].str.replace('US_', '')

## To do list
Calculate population number they are using for each state and use it to calculate the Pop_Pct for Administered_Bivalent column

Compare Administered to Recip_Administered to see if there are any significant differences in any state

Make some smaller dataframes for viewing:

a) Whole pop with Distrib, Administered, Dose1, Series Complete, Additional Doses, Second Booster, Administered Bivalent

b) Each individual age group with Dose1, Series Complete, Additional Doses, Second Booster, Bivalent Booster

c) Each category (Dose1, Series Complete, Additional Doses, Second Booster, Bivalent Booster) with all age ranges

Identify which states have a high variance from the mean (general/nationwide population) in % vaccinated (looking at all dosage categories and age categories). This will show us which states were the "good vaccinators" and which the "poor vaccinators." We can then use the EARN data to see if this correlates to how much of the federal money they spent, how many vaccination projects they did, etc.


In [None]:
# get all the columns we will be interested in into one dataframe
# NOTE: there is no Pop_Pct column for the administered_bivalent, and second_booster only for the age breakouts
# but we can extrapolate from their other population calculations to calculate these. For second_booster to get state numbers
# we have to add up the vaccines from the different manufacturers because we don't have them already summed.

vac_df = pd.DataFrame(data=vac_df, columns=["Location", "Distributed", "Administered", "Recip_Administered",
                                                   "Administered_Dose1_Recip", "Administered_Dose1_Pop_Pct",
                                                   "Administered_Dose1_Recip_5Plus", "Administered_Dose1_Recip_5PlusPop_Pct",
                                                  "Administered_Dose1_Recip_12Plus", "Administered_Dose1_Recip_12PlusPop_Pct",
                                                  "Administered_Dose1_Recip_18Plus", "Administered_Dose1_Recip_18PlusPop_Pct",
                                                  "Administered_Dose1_Recip_65Plus", "Administered_Dose1_Recip_65PlusPop_Pct",
                                                  "Series_Complete_Yes", "Series_Complete_Pop_Pct", "Series_Complete_5Plus",
                                                  "Series_Complete_12Plus", "Series_Complete_12PlusPop_Pct",
                                                   "Series_Complete_18Plus", "Series_Complete_18PlusPop_Pct",
                                                   "Series_Complete_65Plus", "Series_Complete_65PlusPop_Pct", "Additional_Doses",
                                                   "Additional_Doses_Vax_Pct", "Additional_Doses_5Plus",
                                                   "Additional_Doses_5Plus_Vax_Pct", "Additional_Doses_12Plus",
                                                   "Additional_Doses_12Plus_Vax_Pct", "Additional_Doses_18Plus",
                                                   "Additional_Doses_18Plus_Vax_Pct", "Additional_Doses_50Plus",
                                                   "Additional_Doses_50Plus_Vax_Pct", "Additional_Doses_65Plus",
                                                   "Additional_Doses_65Plus_Vax_Pct", "Second_Booster_50Plus",
                                                   "Second_Booster_50Plus_Vax_Pct", "Second_Booster_65Plus",
                                                   "Second_Booster_65Plus_Vax_Pct", "Second_Booster_Janssen",
                                                    "Second_Booster_Moderna", "Second_Booster_Pfizer",
                                                    "Second_Booster_Unk_Manuf", "Administered_Bivalent",
                                                   "Bivalent_Booster_5Plus", "Bivalent_Booster_5Plus_Pop_Pct",
                                                   "Bivalent_Booster_12Plus", "Bivalent_Booster_12Plus_Pop_Pct",
                                                   "Bivalent_Booster_18Plus", "Bivalent_Booster_18Plus_Pop_Pct"])

In [None]:
# remove commas from numeric columns
# convert numeric columns to correct type
vac_df = vac_df.replace(',','', regex=True)
numeric_cols = ["Distributed", "Administered", "Recip_Administered", "Administered_Dose1_Recip", "Administered_Dose1_Pop_Pct",
                "Administered_Dose1_Recip_5Plus", "Administered_Dose1_Recip_5PlusPop_Pct", "Administered_Dose1_Recip_12Plus",
                "Administered_Dose1_Recip_12PlusPop_Pct", "Administered_Dose1_Recip_18Plus",
                "Administered_Dose1_Recip_18PlusPop_Pct", "Administered_Dose1_Recip_65Plus",
                "Administered_Dose1_Recip_65PlusPop_Pct", "Series_Complete_Yes", "Series_Complete_Pop_Pct",
                "Series_Complete_5Plus", "Series_Complete_12Plus", "Series_Complete_12PlusPop_Pct", "Series_Complete_18Plus",
                "Series_Complete_18PlusPop_Pct", "Series_Complete_65Plus", "Series_Complete_65PlusPop_Pct", "Additional_Doses",
                "Additional_Doses_Vax_Pct", "Additional_Doses_5Plus", "Additional_Doses_5Plus_Vax_Pct", "Additional_Doses_12Plus",
                "Additional_Doses_12Plus_Vax_Pct", "Additional_Doses_18Plus", "Additional_Doses_18Plus_Vax_Pct",
                "Additional_Doses_50Plus", "Additional_Doses_50Plus_Vax_Pct", "Additional_Doses_65Plus",
                "Additional_Doses_65Plus_Vax_Pct", "Second_Booster_50Plus", "Second_Booster_50Plus_Vax_Pct",
                "Second_Booster_65Plus", "Second_Booster_65Plus_Vax_Pct", "Second_Booster_Janssen",
                "Second_Booster_Moderna", "Second_Booster_Pfizer", "Second_Booster_Unk_Manuf", "Administered_Bivalent",
                "Bivalent_Booster_5Plus", "Bivalent_Booster_5Plus_Pop_Pct", "Bivalent_Booster_12Plus",
                "Bivalent_Booster_12Plus_Pop_Pct", "Bivalent_Booster_18Plus", "Bivalent_Booster_18Plus_Pop_Pct"]
vac_df[numeric_cols] = vac_df[numeric_cols].apply(pd.to_numeric)
vac_df

In [None]:
# calculate totals for second booster
vac_df["Second_Booster_Total"] = (vac_df["Second_Booster_Janssen"] + vac_df["Second_Booster_Moderna"]
                                + vac_df["Second_Booster_Pfizer"] + vac_df["Second_Booster_Unk_Manuf"])
# find their population number... ok this is off. ???
vac_df["Pop1"] = vac_df["Series_Complete_Yes"] / (vac_df["Series_Complete_Pop_Pct"]/100)
vac_df["Pop2"] = vac_df["Administered_Dose1_Recip"] / (vac_df["Administered_Dose1_Pop_Pct"]/100)

vac_pops_df = pd.DataFrame(data=vac_df, columns=["Location", "Pop1", "Pop2"])
vac_pops_df


In [None]:
# df with vax data for all ages
vac_all_ages_df = pd.DataFrame(data=vac_df, columns=["Location", "Distributed", "Administered", "Recip_Administered",
                                                   "Administered_Dose1_Recip", "Administered_Dose1_Pop_Pct",
                                                    "Series_Complete_Yes", "Series_Complete_Pop_Pct",
                                                     "Additional_Doses", "Additional_Doses_Vax_Pct", "Administered_Bivalent"])

# add 'Dose Differential' column to track doses administered to nonresidents. Negative number = doses leaving the state
vac_all_ages_df["Dose Differential"] = vac_all_ages_df["Administered"] - vac_all_ages_df["Recip_Administered"]
vac_all_ages_df["Dose Diff. as Pct of Doses Given"] = abs(vac_all_ages_df["Dose Differential"] / vac_all_ages_df["Administered"])
vac_all_ages_df["Dose Diff. as Pct of Residents Vaxxed"] = abs(vac_all_ages_df["Dose Differential"] / vac_all_ages_df["Recip_Administered"])


In [None]:
vac_dd_df = pd.DataFrame(data=vac_all_ages_df, columns=["Location", "Distributed", "Administered", "Recip_Administered", "Dose Differential",
                         "Dose Diff. as Pct of Doses Given", "Dose Diff. as Pct of Residents Vaxxed", "Administered_Dose1_Pop_Pct", "Series_Complete_Pop_Pct",
                         "Additional_Doses_Vax_Pct"])
vac_dd_df




In [None]:
# Checking on second booster columns -- these NaN values actually exist in the spreadsheet. Is there something going on with
# the function that was used to create the initial dataframe?
vac_secondbooster_df = pd.DataFrame(data=vac_df, columns=["Location", "Second_Booster_50Plus",
                                                   "Second_Booster_50Plus_Vax_Pct", "Second_Booster_65Plus",
                                                   "Second_Booster_65Plus_Vax_Pct", "Second_Booster_Janssen",
                                                    "Second_Booster_Moderna", "Second_Booster_Pfizer",
                                                    "Second_Booster_Unk_Manuf"])


In [None]:
# find how many doses were distributed vs administered
# calculate percent
# sort alphabetically by state
vac_waste_df = pd.DataFrame(data=vac_df, columns=["Location", "Distributed", "Administered"])
vac_waste_df["Pct. Administered"] = vac_waste_df["Administered"] / vac_waste_df["Distributed"]
vac_waste_df.sort_values('Location')

In [None]:
# Show best 10 states in vaccine distribution percentage
vac_waste_best_df = vac_waste_df.sort_values('Pct. Administered', ascending=False)
vac_waste_best_df.head(10)

In [None]:
# Show worst 10 states in vaccine distribution percentage
vac_waste_worst_df = vac_waste_df.sort_values('Pct. Administered', ascending=True)
vac_waste_worst_df.head(10)

# Kendal Work Area

In [47]:
#putting Greg's code down here so I can run my area independently of the rest of the sheet without error
#Import libraries
import pandas as pd
import matplotlib.pyplot as plt 
import plotly.express as px
import plotly.graph_objects as go

In [48]:
#Import vaccination data from csv
vac_df = pd.read_csv('Resources/COVID-19_Vaccinations_in_the_United_States_Jurisdiction.csv')


Columns (7,8,10,11,15,16,22,25,26,32,34,42,44,53,55,56,57,58,59,60,61,62,71,73,75,77,79,81,83,84,85,86,87,88,90,92,93,94,95,96,97,98,99,100,101,103,105,107) have mixed types. Specify dtype option on import or set low_memory=False.



In [49]:
#function formats the CDC dataframe for US jursdictions - see below for input formats
def CDC_format(df,key,date,add_str,drop_values): #key, filt -> str; length -> int; columns, drop_values -> list
    df = df.dropna(subset=[key])
    df = df[vac_df['Date'] == date]
    df = df[~df[key].isin(drop_values)]
    df[key] = add_str + vac_df[key].astype(str)
    df.reset_index(drop = True, inplace = True)
    return df

In [50]:
#drop non-state territories from dataframe
drop = ['DD2','FM','AS','VI','BP2','IH2','GU','PN','PR','VA2','PW','US','MP','MH']
vac_df = CDC_format(vac_df,'Location','12/28/2022','US_',drop)

In [51]:
#change location to match state code for choropleth maps
vac_df['Location'] = vac_df['Location'].str.replace('US_', '')

In [52]:
#create df with only columns related to choropleth maps
choropleth_vac_df = vac_df[['Location', 
                            'Distributed', 
                            'Administered', 
                            'Administered_Dose1_Pop_Pct', 
                            'Series_Complete_Pop_Pct', 
                            'Series_Complete_5PlusPop_Pct', 
                            'Series_Complete_12PlusPop_Pct', 
                            'Series_Complete_18PlusPop_Pct', 
                            'Series_Complete_65PlusPop_Pct', 
                            'Additional_Doses_Vax_Pct', 
                            'Additional_Doses_65Plus_Vax_Pct', 
                            'Second_Booster_65Plus_Vax_Pct', 
                            'Bivalent_Booster_5Plus_Pop_Pct', 
                            'Bivalent_Booster_12Plus_Pop_Pct', 
                            'Bivalent_Booster_18Plus_Pop_Pct', 
                            'Bivalent_Booster_65Plus_Pop_Pct']]

In [53]:
fig_complete_total_pop = px.choropleth(vac_df,
                    locations='Location',
                    locationmode="USA-states",
                    scope="usa",
                    color='Series_Complete_Pop_Pct',
                    labels={'Series_Complete_Pop_Pct':'% of Population Fully Vaccinated'},
                    color_continuous_scale="aggrnyl_r",
                    title='Vaccination Status by State - Fully Vaccinated'
                    )
fig_complete_total_pop

In [54]:
fig_complete_5plus = px.choropleth(vac_df,
                    locations='Location',
                    locationmode="USA-states",
                    scope="usa",
                    color='Series_Complete_5PlusPop_Pct',
                    range_color=(30,95),
                    labels={'Series_Complete_5PlusPop_Pct':'% of 5+ Population Fully Vaccinated'},
                    color_continuous_scale="aggrnyl",
                    title='Vaccination Status by State - Fully Vaccinated (5+)'
                    )
fig_complete_5plus

In [55]:
fig_complete_12plus = px.choropleth(vac_df,
                    locations='Location',
                    locationmode="USA-states",
                    scope="usa",
                    color='Series_Complete_12PlusPop_Pct',
                    labels={'Series_Complete_12PlusPop_Pct':'% of 12+ Population Fully Vaccinated'},
                    color_continuous_scale="aggrnyl",
                    title='Vaccination Status by State - Fully Vaccinated (12+)'
                    )
fig_complete_12plus

In [56]:
fig_complete_18plus = px.choropleth(vac_df,
                    locations='Location',
                    locationmode="USA-states",
                    scope="usa",
                    color='Series_Complete_18PlusPop_Pct',
                    labels={'Series_Complete_18PlusPop_Pct':'% of 18+ Population Fully Vaccinated'},
                    color_continuous_scale="aggrnyl_r",
                    title='Vaccination Status by State - Fully Vaccinated (18+)'
                    )
fig_complete_18plus

In [57]:
fig_at_least_1 = px.choropleth(vac_df,
                    locations='Location',
                    locationmode="USA-states",
                    scope="usa",
                    color='Administered_Dose1_Pop_Pct',
                    labels={'Administered_Dose1_Pop_Pct':'% of Population Partially or Fully Vaccinated'},
                    color_continuous_scale="blues",
                    title='Vaccination Status by State - Partially or Fully Vaccinated'
                    )
fig_at_least_1

In [58]:
fig_complete_65_plus = px.choropleth(vac_df,
                    locations='Location',
                    locationmode="USA-states",
                    scope="usa",
                    color='Series_Complete_65PlusPop_Pct',
                    range_color=(30,95),
                    labels={'Series_Complete_65PlusPop_Pct':'% of 65+ Population Fully Vaccinated'},
                    color_continuous_scale="greens", 
                    title='Vaccination Status by State & Age - Fully Vaccinated (65+)'
                    )
fig_complete_65_plus

In [59]:
fig_bivalent_booster_65 = px.choropleth(vac_df,
                    locations='Location',
                    locationmode="USA-states",
                    scope="usa",
                    color='Bivalent_Booster_65Plus_Pop_Pct',
                    labels={'Bivalent_Booster_65Plus_Pop_Pct':'% of 65+ Population with Bivalent Booster'},
                    color_continuous_scale="reds",  
                    title='Bivalent Booster Status by State - (65+)'
                    )
fig_bivalent_booster_65

In [60]:
fig_bivalent_booster_5 = px.choropleth(vac_df,
                    locations='Location',
                    locationmode="USA-states",
                    scope="usa",
                    color='Bivalent_Booster_5Plus_Pop_Pct',
                    labels={'Bivalent_Booster_5Plus_Pop_Pct':'% of 5+ Population with Bivalent Booster'},
                    color_continuous_scale="oranges",  
                    title='Bivalent Booster Status by State - (5+)'
                    )
fig_bivalent_booster_5

In [62]:
fig_bivalent_booster_12 = px.choropleth(vac_df,
                    locations='Location',
                    locationmode="USA-states",
                    scope="usa",
                    color='Bivalent_Booster_12Plus_Pop_Pct',
                    labels={'Bivalent_Booster_12Plus_Pop_Pct':'% of 12+ Population with Bivalent Booster'},
                    color_continuous_scale="purples",  
                    title='Bivalent Booster Status by State - (12+)'
                    )
fig_bivalent_booster_12

In [63]:
fig_bivalent_booster_18 = px.choropleth(vac_df,
                    locations='Location',
                    locationmode="USA-states",
                    scope="usa",
                    color='Bivalent_Booster_18Plus_Pop_Pct',
                    labels={'Bivalent_Booster_18Plus_Pop_Pct':'% of 18+ Population with Bivalent Booster'},
                    color_continuous_scale="magma_r",  
                    title='Bivalent Booster Status by State - (18+)'
                    )
fig_bivalent_booster_18

In [64]:
#boxplots showing spread of data across all 50 states and DC for selected columns
boxplot = vac_df.boxplot(column=['Series_Complete_Pop_Pct', 
                                 'Administered_Dose1_Pop_Pct', 
                                 'Series_Complete_65PlusPop_Pct', 
                                 'Bivalent_Booster_65Plus_Pop_Pct'], 
                         rot=45,
                         grid=True,
                         figsize = (15,10),
                        )
plt.title("Distribution of Vaccination Rates Across U.S. States")
plt.xticks([1, 2, 3, 4], ['% Pop. Fully Vaccinated', '% Pop. Partially  or Fully Vaccinated', '% Pop. Fully Vaccinated - 65+', '% Pop. Bivalent Booster - 65+'])


([<matplotlib.axis.XTick at 0x7fcb32c5b070>,
  <matplotlib.axis.XTick at 0x7fcb50fcc700>,
  <matplotlib.axis.XTick at 0x7fcb32c37400>,
  <matplotlib.axis.XTick at 0x7fcb3293bd30>],
 [Text(1, 0, '% Pop. Fully Vaccinated'),
  Text(2, 0, '% Pop. Partially  or Fully Vaccinated'),
  Text(3, 0, '% Pop. Fully Vaccinated - 65+'),
  Text(4, 0, '% Pop. Bivalent Booster - 65+')])

In [65]:
#summary statistics for selected columns (across all 50 states and DC) 
choropleth_vac_df.describe()

Unnamed: 0,Administered_Dose1_Pop_Pct,Series_Complete_Pop_Pct,Series_Complete_5PlusPop_Pct,Series_Complete_12PlusPop_Pct,Series_Complete_18PlusPop_Pct,Series_Complete_65PlusPop_Pct,Additional_Doses_Vax_Pct,Additional_Doses_65Plus_Vax_Pct,Second_Booster_65Plus_Vax_Pct,Bivalent_Booster_5Plus_Pop_Pct,Bivalent_Booster_12Plus_Pop_Pct,Bivalent_Booster_18Plus_Pop_Pct,Bivalent_Booster_65Plus_Pop_Pct
count,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0
mean,78.680392,67.827451,71.77451,75.852941,77.496078,92.098039,50.894118,74.141176,58.372549,16.107843,17.354902,18.396078,40.094118
std,11.232934,9.773374,9.994836,9.591066,9.058652,3.429956,6.92553,7.081615,7.967762,6.004493,6.277462,6.447603,10.75562
min,60.7,52.9,56.2,60.6,63.1,83.8,31.6,45.9,36.5,5.6,6.2,6.7,18.4
25%,69.1,59.5,62.9,67.15,69.2,89.15,47.0,70.25,53.5,11.15,12.15,13.15,32.35
50%,77.1,66.0,70.6,75.0,77.1,94.1,49.5,73.6,59.1,15.1,16.3,17.3,40.2
75%,90.35,74.65,78.95,83.15,84.15,95.0,56.45,80.05,63.5,20.1,21.65,22.8,48.15
max,95.0,87.4,91.7,94.5,95.0,95.0,66.1,86.1,73.3,30.8,32.3,33.4,62.8


In [97]:
#working on expenditure categories by state
spending_overall = all_us_projects_df.groupby('Expenditure Category Group').count()
total_projects = spending_overall['State'].sum()
categories_percentage_overall = (spending_overall['State']/total_projects)*100
categories_percentage_overall = pd.DataFrame(categories_percentage_overall)
categories_percentage_overall.rename(columns={'State': '% of Total Projects'}, inplace=True)
categories_percentage_overall

Unnamed: 0_level_0,% of Total Projects
Expenditure Category Group,Unnamed: 1_level_1
1-Public Health,19.943828
2-Negative Economic Impacts,26.25474
3-Public Health-Negative Economic Impact: Public Sector Capacity,6.347423
4-Premium Pay,2.11768
5-Infrastructure,13.798624
6-Revenue Replacement,26.535599
7-Administrative,5.002106


In [98]:
spending_covid_project = covid_projects_df.groupby("Expenditure Category Group").count()
total_projects_covid = spending_covid_project['Recipient Name'].sum()
categories_percentage_covid = (spending_covid_project['Recipient Name']/total_projects_covid)*100
categories_percentage_covid = pd.DataFrame(categories_percentage_covid)
categories_percentage_covid.rename(columns={'Recipient Name':'% of Covid Projects'}, inplace=True)
categories_percentage_covid

Unnamed: 0_level_0,% of Covid Projects
Expenditure Category Group,Unnamed: 1_level_1
1-Public Health,37.507599
2-Negative Economic Impacts,30.504559
3-Public Health-Negative Economic Impact: Public Sector Capacity,10.091185
4-Premium Pay,3.075988
5-Infrastructure,1.130699
6-Revenue Replacement,15.039514
7-Administrative,2.650456


In [102]:
spending_by_state_all = all_us_projects_df.groupby(['State', 'Expenditure Category Group']).count()
total_projects_per_state_all = spending_by_state_all.groupby('State')['Recipient Name'].sum()
categories_percentage_by_state_all = (spending_by_state_all['Recipient Name']/total_projects_per_state_all)*100
categories_percentage_by_state_all = pd.DataFrame(categories_percentage_by_state_all)
categories_percentage_by_state_all.rename(columns={'Recipient Name': '% of All Projects'}, inplace=True)
categories_percentage_by_state_all

Unnamed: 0_level_0,Unnamed: 1_level_0,% of All Projects
State,Expenditure Category Group,Unnamed: 2_level_1
AK,1-Public Health,4.597701
AK,2-Negative Economic Impacts,67.816092
AK,4-Premium Pay,2.298851
AK,5-Infrastructure,11.494253
AK,6-Revenue Replacement,12.643678
...,...,...
WY,3-Public Health-Negative Economic Impact: Public Sector Capacity,1.574803
WY,4-Premium Pay,0.787402
WY,5-Infrastructure,3.937008
WY,6-Revenue Replacement,35.433071


In [103]:
spending_by_state_covid = all_us_projects_df.groupby(['State', 'Expenditure Category Group']).count()
total_projects_per_state_covid = spending_by_state_covid.groupby('State')['Recipient Name'].sum()
categories_percentage_by_state_covid = (spending_by_state_covid['Recipient Name']/total_projects_per_state_covid)*100
categories_percentage_by_state_covid = pd.DataFrame(categories_percentage_by_state_covid)
categories_percentage_by_state_covid.rename(columns={'Recipient Name': '% of Covid Projects'}, inplace=True)
categories_percentage_by_state_covid

Unnamed: 0_level_0,Unnamed: 1_level_0,% of Covid Projects
State,Expenditure Category Group,Unnamed: 2_level_1
AK,1-Public Health,4.597701
AK,2-Negative Economic Impacts,67.816092
AK,4-Premium Pay,2.298851
AK,5-Infrastructure,11.494253
AK,6-Revenue Replacement,12.643678
...,...,...
WY,3-Public Health-Negative Economic Impact: Public Sector Capacity,1.574803
WY,4-Premium Pay,0.787402
WY,5-Infrastructure,3.937008
WY,6-Revenue Replacement,35.433071


# Sarah Work Area

# Aaliyah Work Area