<a href="https://www.kaggle.com/code/rafaelcruza/us-vaccine-tracker?scriptVersionId=176918972" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction

It can be a troubling time, but we do have hope on the horizon, with the news we get daily about vaccines. Multiple companies are releasing and getting their vaccines approved; we may  soon see a path forward. 

Using the robust toolset provided by Kaggle, I'll show you how to create an interactive map to track, for each state, the percentage of inhabitants that have been vaccinated against COVID-19.  

To get started, if you haven't already, make your own copy of this notebook by clicking on the **[Copy and Edit]** button in the top right corner. 

This notebook is an example of a project that you can create based on what you'd learn from taking Kaggle's [Geospatial Analysis course](https://www.kaggle.com/learn/geospatial-analysis).

# US Vaccine Tracker

We'll use two datasets.  

- The first dataset has the total number of inhabitants of each state, along with latitude and longitude data for each state's capital city.  This dataset is pulled from the 2019 US Census, and I've uploaded it [here](https://www.kaggle.com/peretzcohen/2019-census-us-population-data-by-state).
- The second dataset contains a recent estimate for the total number of people that have been vaccinated in each state.  This [vaccine dataset](https://github.com/owid/covid-19-data/blob/master/public/data/vaccinations/us_state_vaccinations.csv) is drawn from [Our World In Data](https://ourworldindata.org/), who update their vaccine datasets from the CDC quite regularly.  Every time you run this notebook, you'll use the most recent version of their data.

In the next code cell, we load and preprocess the data.  As output, you'll see the total percent of the population that has been vaccinated in the US, along with a preview of the Pandas DataFrame that we'll use to make the tracker.

In [1]:
# Imports
import pandas as pd
from datetime import date, timedelta
import folium
from folium import Marker
from folium.plugins import MarkerCluster
import math
import matplotlib.pyplot as plt
import seaborn as sns

# Population Data
populationData = pd.read_csv('/kaggle/input/2019-census-us-population-data-by-state/2019_Census_US_Population_Data_By_State_Lat_Long.csv')

# Vaccination data, for most recent date
vaccinationData = pd.read_csv('https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/us_state_vaccinations.csv')
vaccinationByLocation = vaccinationData[["location", "people_vaccinated"]]

# Vaccination and population data
# Correct the New York State name in the vaccination data using .loc
vaccinationByLocation.loc[vaccinationByLocation['location'] == 'New York State', 'location'] = 'New York'

# Then perform the merge operation
vaccinationAndPopulationByLocation = pd.merge(populationData, vaccinationByLocation, left_on='STATE', right_on='location').drop(columns="location")

# Calculate percentage vaccinated by state
vaccinationAndPopulationByLocation["percent_vaccinated"] = vaccinationAndPopulationByLocation["people_vaccinated"] / vaccinationAndPopulationByLocation["POPESTIMATE2019"]

print(vaccinationAndPopulationByLocation.head(10))
print(vaccinationAndPopulationByLocation.STATE.unique())

     STATE  POPESTIMATE2019        lat       long  people_vaccinated  \
0  Alabama          4903185  32.377716 -86.300568            70861.0   
1  Alabama          4903185  32.377716 -86.300568            74792.0   
2  Alabama          4903185  32.377716 -86.300568            80480.0   
3  Alabama          4903185  32.377716 -86.300568            86956.0   
4  Alabama          4903185  32.377716 -86.300568                NaN   
5  Alabama          4903185  32.377716 -86.300568                NaN   
6  Alabama          4903185  32.377716 -86.300568                NaN   
7  Alabama          4903185  32.377716 -86.300568           114319.0   
8  Alabama          4903185  32.377716 -86.300568           121113.0   
9  Alabama          4903185  32.377716 -86.300568           144429.0   

   percent_vaccinated  
0            0.014452  
1            0.015254  
2            0.016414  
3            0.017735  
4                 NaN  
5                 NaN  
6                 NaN  
7            0.

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)


In [2]:
print("Date ran:", date.today())

# Calculate the total percent vaccinated in the US
percentageTotal = (vaccinationAndPopulationByLocation["people_vaccinated"].sum()) / vaccinationAndPopulationByLocation["POPESTIMATE2019"].sum()
print(vaccinationAndPopulationByLocation["people_vaccinated"].sum())
print('Percentage Vaccinated in the US: {}%'.format(round(percentageTotal*100, 2))) 

# This code is wrong as the dataset possess NaN values for the people_vaccinated column,
# the correct one is below the next 2 code blocks

Date ran: 2024-05-11
92798979584.0
Percentage Vaccinated in the US: 33.3%


## Calculating vaccinated percentage by state

In [3]:
# Group by 'STATE' and calculate the max of 'people_vaccinated' for each state as it is a cumulative column
total_vaccinationByState = vaccinationAndPopulationByLocation.groupby('STATE')['people_vaccinated'].max()
print(total_vaccinationByState)

# Get the unique population estimate for each state as it repeats in each row
populationByState = vaccinationAndPopulationByLocation.groupby('STATE')['POPESTIMATE2019'].first()
print(populationByState)

# Calculate the percentage vaccinated for each state and round it to the second decimal place
percent_vaccinatedByState = ((total_vaccinationByState / populationByState) * 100).round(2)
print(percent_vaccinatedByState)

# Note that vaccination percentage can be above 100% for some reason,
# This could mean some discrepancies between the datasets or for example,
# That some people went to another state to be vaccinated and it was then registered
# In that state instead of the person's state of origin

STATE
Alabama                  3193141.0
Alaska                    535718.0
Arizona                  5704677.0
Arkansas                 2115165.0
California              33613401.0
Colorado                 4837792.0
Connecticut              3670090.0
Delaware                  861811.0
District of Columbia      836680.0
Florida                 17810446.0
Georgia                  7287758.0
Hawaii                   1297884.0
Idaho                    1146055.0
Illinois                10036899.0
Indiana                  4350210.0
Iowa                     2235485.0
Kansas                   2229631.0
Kentucky                 3086324.0
Louisiana                2924163.0
Maine                    1315892.0
Maryland                 5566178.0
Massachusetts            7393770.0
Michigan                 6979192.0
Minnesota                4461994.0
Mississippi              1839306.0
Missouri                 4269469.0
Montana                   731323.0
Nebraska                 1425923.0
Nevada        

## Calculating total number of vaccinations and the total population across all states, then calculating overall percentage

In [4]:
# Calculate the total number of vaccinations and the total population across all states
total_vaccinations = total_vaccinationByState.sum()
total_population = populationByState.sum()

# Calculate the overall percentage vaccinated and round it to the second decimal place
overall_percent_vaccinated = ((total_vaccinations / total_population) * 100).round(2)

print('Overall Percentage Vaccinated in the US: {}%'.format(overall_percent_vaccinated))


Overall Percentage Vaccinated in the US: 80.91%


### As seen before, there were states with vaccination percentage above 100%

In [5]:
# Filter the states with a vaccination percentage above 100%
states_above_100 = percent_vaccinatedByState[percent_vaccinatedByState > 100]

# Print the states
print(states_above_100)

STATE
Connecticut             102.94
District of Columbia    118.55
Massachusetts           107.27
Rhode Island            106.36
dtype: float64


## Now, creating a choropleth map with vaccination percentage by state

In [6]:
import requests
import json

# Create the map
v_map = folium.Map(location=[42.32,-71.0589], zoom_start=4) 

# Preparing state border map for choropleth map
url = 'https://raw.githubusercontent.com/python-visualization/folium/master/examples/data'
state_geo_url = f'{url}/us-states.json'

# Load GeoJSON file from URL
response = requests.get(state_geo_url)
state_geo = json.loads(response.text)

# Adding state borders
folium.GeoJson(state_geo).add_to(v_map)

# Fixing different index in percent_vaccinatedByState and in the geojson
state_abbreviations = {
    'Alabama': 'AL', 'Alaska': 'AK', 'Arizona': 'AZ', 'Arkansas': 'AR', 'California': 'CA', 'Colorado': 'CO',
    'Connecticut': 'CT', 'Delaware': 'DE', 'District of Columbia': 'DC', 'Florida': 'FL', 'Georgia': 'GA',
    'Hawaii': 'HI', 'Idaho': 'ID', 'Illinois': 'IL', 'Indiana': 'IN', 'Iowa': 'IA', 'Kansas': 'KS',
    'Kentucky': 'KY', 'Louisiana': 'LA', 'Maine': 'ME', 'Maryland': 'MD', 'Massachusetts': 'MA', 'Michigan': 'MI',
    'Minnesota': 'MN', 'Mississippi': 'MS', 'Missouri': 'MO', 'Montana': 'MT', 'Nebraska': 'NE', 'Nevada': 'NV',
    'New Hampshire': 'NH', 'New Jersey': 'NJ', 'New Mexico': 'NM', 'New York': 'NY', 'North Carolina': 'NC',
    'North Dakota': 'ND', 'Ohio': 'OH', 'Oklahoma': 'OK', 'Oregon': 'OR', 'Pennsylvania': 'PA', 'Rhode Island': 'RI',
    'South Carolina': 'SC', 'South Dakota': 'SD', 'Tennessee': 'TN', 'Texas': 'TX', 'Utah': 'UT', 'Vermont': 'VT',
    'Virginia': 'VA', 'Washington': 'WA', 'West Virginia': 'WV', 'Wisconsin': 'WI', 'Wyoming': 'WY'
}
percent_vaccinatedByState.index = percent_vaccinatedByState.index.map(state_abbreviations)

# Add the percentage data to the GeoJSON with a percent sign
for feature in state_geo['features']:
    state_id = feature['id']
    # Ensure that the state_id exists in the percent_vaccinatedByState DataFrame before accessing it
    if state_id in percent_vaccinatedByState.index:
        feature['properties']['percentage'] = f"{percent_vaccinatedByState[state_id]}%"

# Add choropleth to the map
choropleth = folium.Choropleth(
    geo_data=state_geo, 
    data=percent_vaccinatedByState,
    key_on="feature.id", 
    fill_color='YlGnBu',
    legend_name='Population percentage vaccinated by US state'
).add_to(v_map)

# Add tooltip to the map
choropleth.geojson.add_child(
    folium.features.GeoJsonTooltip(['name', 'percentage'], labels=False)
)

# Display the map
v_map

# Your turn

Here are some ideas for how you might improve on the work here:
- In Kaggle's [Geospatial Analysis course](https://www.kaggle.com/learn/geospatial-analysis), you learn how to use folium to create many different types of interactive maps.  How might you use this data to instead create a choropleth map?
- In case you would like to work with more data sources,
  - The Centers for Disease Control and Prevention (CDC) in the US releases daily vaccine data and has a vaccination progress tracker on its [COVID Data Tracker site](https://covid.cdc.gov/covid-data-tracker/#vaccinations).
  - NBC News has a [vaccine tracker](https://www.nbcnews.com/health/health-news/map-covid-19-vaccination-tracker-across-u-s-n1252085) as well which is quite well done.
  
Once you have created your own extension of this work, let us know about it in the comments!