# COVID-19 Analytics and Visualization

## Notebook 1: Loading data from the internet and saving it to your computer

v1.0, December 21, 2020, Copyright &copy; Ken Urquhart

The latest version of this notebook can be found at:

* https://github.com/KenU798/Analytics-and-Data-Visualization-for-COVID-19

## Table of contents

* [0. Loading libraries and setting options](#s0)
* [1. COVID Tracking Project data for the USA (by County) in CSV format](#s1)
* [2. COVID Act Now time series data for the USA (by County) in CSV format ](#s2)
* [3. John's Hopkins time series data for the USA (by County) in CSV format](#s3)
* [4. John's Hopkins global time series data in CSV format](#s4)
* [5. Mapping and Location Data](#s5)
* [6. Countries of the world boundaries](#s6)
* [7. U.S. State and County geographic boundaries](#s7)
* [8. The data is downloaded and ready for use](#s8)


# COVID-19 in December 2020

* Over 15 million Americans have contracted COVID-19 and more than 300,000 have died.
* Globally there have been more than 76.9 million cases and 1.7 million deaths.

COVID-19 appears to be 5x more deadly than the worst flu season we went through in the last 10 years.

While vaccines are being approved for use, it will be at least 2 months before those who receive initial vaccinations might be immunue...and even longer before most of us will have access to a vaccine.

And we are now facing a strong resurgence of COVID-19 as we go into winter.

Everywhere you look in the news, you see COVID-19 stories with:

* Graphs and tables showing infection growth, current hospitalizations, ICU beds in use, etc.
* Interactive graphs with pop-ups show even more information when you hover over lines or points
* Graphs that overlay data and maps of U.S. states and counties
* Maps with time-based sliders that show the progression of cases and deaths

And you can create any and all of these visualizations using published COVID-19 data and free data science tools.

## Learning data science by studying COVID-19 public data

Learning how to do these for COVID-19 data offers you a great opportunity to learn how to turn raw data into high-impact analyses and visualizations...and along the way you will learn a lot about using Python, platform independent data preparation, and interactive presentation tricks.

# 0. Loading libraries and setting options <a class="anchor" id="s1"></a>

Let's get started by importing the Python `libraries` we will need...

In [None]:
# for working with data and doing analytics
import math
import numpy as np
import pandas as pd
import geopandas as gpd

# for working with zip files
from io import BytesIO
from urllib.request import urlopen, urlcleanup
from zipfile import ZipFile

# for working with directories and files
import shutil
from pathlib import Path # OS independent way of dealing with files and paths

# for working with dates and timestamped data
from datetime import datetime

# for visualization and interactive plots and maps
import matplotlib.pyplot as plt
import altair as alt
import plotly as px
import folium
from folium import plugins
import branca.colormap as cm

When I'm exploring a new datasets, I like to see all the columns displayed.

Here is a code block that turns on and off display switches for `DataFrames`.

You can read more here: https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html

In [None]:
# Uncomment to see all pandas options
#pd.describe_option()

# Display all columns (with restore default truncation commented out)
pd.set_option('display.max_columns', None)
#pd.reset_option('display.max_columns')

# Display all rows (with restore default truncation)
#pd.set_option('display.max_rows', None)
#pd.reset_option('display.max_rows')

# Set maximum column widths when displaying (with reset)
#pd.set_option('display.max_colwidth', -1)
#pd.reset_option('display.max_colwidth')

# When you don't want to change default options,
# this function displays the passed-in dataframe in full.
def pd_show_full(df):
    with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.width', None):
        display(df)

Create a folder in this notebook's working directory to store all the data files I download.

Name the folder with today's date, delete the directory and its contents if it already exists, delete the file if it has the same name as the directory, and then create an empty folder.

In [None]:
download_folder = Path(datetime.today().strftime('%Y-%m-%d') + '-COVID-19-Data')
display(download_folder)

if download_folder.is_dir():
    shutil.rmtree(download_folder)
elif download_folder.is_file():
    download_folder.unlink()

download_folder.mkdir()

# COVID-19 data sources and how to download them

I will use 3 sources of COVID-19 information. Each one provides different levels of detail on the data.

Depending on your Internet connection, it may take a long time to download each dataset.

I save each file into the directory you are running this notebook in using the `pickle` format.

Read more there: https://docs.python.org/3/library/pickle.html

`Pickle` compresses the data in a format that is read quickly into `DataFrames`.

## WARNING: the pickle module is not secure

Malicious pickle data files can be created that execute arbitrary code during unpickling. Never unpickle data from untrusted sources.

This notebook writes and reads its own `pickle` files locally. No pickle files are read from the Interent.

# 1. COVID Tracking Project data for the USA (by County) in CSV format <a class="anchor" id="s1"></a>

Detailed COVID-19 data for the United States in one `.csv` file.

See their dashboard at: https://covidtracking.com/data

From their website:

>Every day, our volunteers compile the latest numbers on tests, cases, hospitalizations, and patient outcomes from every US state and territory.

The data can be downloaded directly into a `DataFrame` and saved locally in a `pickle` file for later use:

In [None]:
CTP_all_states = 'https://api.covidtracking.com/v1/states/daily.csv'

df_CTP = pd.read_csv(CTP_all_states)

df_CTP.to_pickle(download_folder / 'covid_tracking_project.pkl')

**Did you get a warning message when you ran the cell?**

If you did, you are being warned that some of the non-numeric columns in the `DataFrame` contain missing data. Missing data cells are filled with `NaN`s by default. When that happens in columns of strings and dates (non-numeric data) you get the warning.

Ignore that for now. I'll deal with it later during analysis.

Here's the `DataFrame` with all columns displayed:

In [None]:
df_CTP

### COVID Tracking Project data dictionary

The data dictionary explaining what each column contains is located in the "Historic values for all states" section of their web page: https://covidtracking.com/data/api

# 2. COVID Act Now time series data for the USA (by County) in CSV format <a class="anchor" id="s2"></a>

Detailed data for all United States in one `.csv` file. See their dashboard at:

https://covidtracking.com/data

From their website:

>Covid Act Now is a 501(c)(3) nonprofit founded in March 2020. We strive to provide the most timely and accurate local COVID data so that every American can make informed decisions during the pandemic. We are committed to:
>
> **Data**: We support data- and science-backed policies and decision-making
>
> **Transparency**: Our data and methodologies are fully open-source so that the public can vet, freely use, and build upon our work
>
> **Accessibility**: We make data universally accessible so that anyone can easily understand and use it, regardless of ability or prior knowledge

They have a website API that needs a free access code you can get by registering here (you need to supply an e-mail address and a little information on what you will use the data for):

https://apidocs.covidactnow.org/access

Once you have your API key, the data can be downloaded using:

In [None]:
CAN_Key = '<your key here>' # replace with your API key
CAN_File = 'states.timeseries.csv'

df_CAN = pd.read_csv(f'https://api.covidactnow.org/v2/{CAN_File}?apiKey={CAN_Key}')
df_CAN.to_pickle(download_folder / 'covid_act_now.pkl')

In [None]:
df_CAN

# 3. John's Hopkins time series data for the USA (by County) in CSV format <a class="anchor" id="s3"></a>

Home page at John's Hopkins University: https://coronavirus.jhu.edu

Raw data home page on Github: https://github.com/CSSEGISandData/COVID-19

From their website:

> The Johns Hopkins Coronavirus Resource Center (CRC) is a continuously updated source of COVID-19 data and expert guidance. We aggregate and analyze the best data available on COVID-19—including cases, as well as testing, contact tracing and vaccine efforts—to help the public, policymakers and healthcare professionals worldwide respond to the pandemic.

## Loading JHU COVID-19 Data into a DataFrame

JHU stores their data in `.csv` files and can be downloaded directly or read directly via URL from `Github`.

You have to watch out for John's Hopkins updating their data access URLs from time to time when they re-organize their Github repository.

Download confirmed U.S. cases by county, state, and date (international data by country is also available):

In [None]:
JHU_Github_URL = 'https://raw.githubusercontent.com/CSSEGISandData/'
JHU_Github_Path = 'COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/'

JHU_File_US_confirmed = 'time_series_covid19_confirmed_US.csv'
JHU_File_US_deaths = 'time_series_covid19_deaths_US.csv'

df_JHU_US_confirmed = pd.read_csv(JHU_Github_URL + JHU_Github_Path + JHU_File_US_confirmed)
df_JHU_US_confirmed.to_pickle(download_folder / 'jhu_confirmed_us.pkl')

df_JHU_US_deaths = pd.read_csv(JHU_Github_URL + JHU_Github_Path + JHU_File_US_deaths)
df_JHU_US_deaths.to_pickle(download_folder / 'jhu_deaths_us.pkl')

In [None]:
df_JHU_US_confirmed

In [None]:
df_JHU_US_deaths

# 4. John's Hopkins global time series data in CSV format <a class="anchor" id="s4"></a>

JHU also provides time series data for most countries of the world in a separate `.csv` file.

Download and save international data by country:

In [None]:
JHU_File_Global_confirmed = 'time_series_covid19_confirmed_global.csv'
JHU_File_Global_deaths = 'time_series_covid19_deaths_global.csv'

df_JHU_Global_confirmed = pd.read_csv(JHU_Github_URL + JHU_Github_Path + JHU_File_Global_confirmed)
df_JHU_Global_confirmed.to_pickle(download_folder / 'jhu_confirmed_global.pkl')

df_JHU_Global_deaths = pd.read_csv(JHU_Github_URL + JHU_Github_Path + JHU_File_Global_deaths)
df_JHU_Global_deaths.to_pickle(download_folder / 'jhu_deaths_global.pkl')

In [None]:
df_JHU_Global_confirmed

In [None]:
df_JHU_Global_deaths

# 5. Mapping and Location Data <a class="anchor" id="s5"></a>

## Load and save the JHU Map Dataset

JHU also provides detailed data on populations and County/State/Province/Municipality/etc. description codes and location data.

This data is used to create interactive maps and normalizing case numbers per 100,000 people so you can compare numbers from different locations.

The detailed explanation of the file contents is at: https://github.com/CSSEGISandData/COVID-19_Unified-Dataset

Don't worry right now if a lot of this table does not make sense. I'll be using it regularly during analysis and visualization - when you will see what each column can be used for.

In [None]:
JHU_Github_Path2 = 'COVID-19_Unified-Dataset/master/'
JHU_geo_data = 'COVID-19_LUT.csv'

df_JHU_geo_data = pd.read_csv(JHU_Github_URL + JHU_Github_Path2 + JHU_geo_data)
df_JHU_geo_data.to_pickle(download_folder / 'jhu_geo_data.pkl')

In [None]:
df_JHU_geo_data

# 6. Countries of the world boundaries <a class="anchor" id="s6"></a>

When making maps, you need the geographic boundaries of all the countries of the world. Fortunately this data is publicly (and freely) available for download.

In [None]:
geodata_url = 'https://www.naturalearthdata.com/http//www.naturalearthdata.com/'
zip_file_path = 'download/10m/cultural/ne_10m_admin_0_countries.zip'

sub_folder = download_folder / 'shapes_global'

urlcleanup()
with urlopen(geodata_url + zip_file_path) as zip_response:
    with ZipFile(BytesIO(zip_response.read())) as zip_file:
        zip_file.extractall(sub_folder)
        
gdf_Global = gpd.read_file(sub_folder / 'ne_10m_admin_0_countries.shp')
gdf_Global.to_pickle(download_folder / 'ne_10m_admin_0_countries.pkl')

shutil.rmtree(sub_folder) # comment out if you want to keep the downloaded folder contents

In [None]:
# Uncomment to test you can read the pickle file
#gdf_Global = pd.read_pickle(download_folder / 'ne_10m_admin_0_countries.pkl')

gdf_Global.plot()

# 7. U.S. State and County geographic boundaries <a class="anchor" id="s7"></a>

You can get U.S. geospatial data from here for free (you just have to say where you got the data from if you use it):

https://hifld-geoplatform.opendata.arcgis.com

Home page for all available boundary data:

* https://hifld-geoplatform.opendata.arcgis.com/search?groupIds=e5cf7f3805274fef90100ab704ee2ac1

Home page for State-level boundaries data (provided by U.S. Census Bureau):

* https://hifld-geoplatform.opendata.arcgis.com/datasets/us-state-boundaries
* Zip file: https://www2.census.gov/geo/tiger/TIGER2017/STATE/tl_2017_us_state.zip
* Attribution (required): U.S. Census Bureau

Home page for County-level boundaries data (provided by U.S. Census Bureau):

* https://hifld-geoplatform.opendata.arcgis.com/datasets/us-county-boundaries
* Zip file: https://www2.census.gov/geo/tiger/TIGER2017/COUNTY/tl_2017_us_county.zip
* Attribution (required): U.S. Census Bureau

In [None]:
geodata_url = 'https://www2.census.gov/geo/tiger/TIGER2017/'
zip_file_path = 'STATE/tl_2017_us_state.zip'

sub_folder = download_folder / 'shapes_state'

urlcleanup()
with urlopen(geodata_url + zip_file_path) as zip_response:
    with ZipFile(BytesIO(zip_response.read())) as zip_file:
        zip_file.extractall(sub_folder)
        
gdf_US_State = gpd.read_file(sub_folder / 'tl_2017_us_state.shp')
gdf_US_State.to_pickle(download_folder / 'tl_2017_us_state.pkl')

shutil.rmtree(sub_folder) # comment out if you want to keep the downloaded folder contents

In [None]:
# Uncomment to test you can read the pickle file
# gdf_US_State = pd.read_pickle(download_folder / 'tl_2017_us_state.pkl')

gdf_US_State.plot()

In [None]:
zip_file_path = 'COUNTY/tl_2017_us_county.zip'

sub_folder = download_folder / 'shapes_county'

urlcleanup()
with urlopen(geodata_url + zip_file_path) as zip_response:
    with ZipFile(BytesIO(zip_response.read())) as zip_file:
        zip_file.extractall(sub_folder)

gdf_US_County = gpd.read_file(sub_folder / 'tl_2017_us_county.shp')
gdf_US_County.to_pickle(download_folder / 'tl_2017_us_county.pkl')

shutil.rmtree(sub_folder) # comment out if you want to keep the downloaded folder contents

In [None]:
# Uncomment to test you can read the pickle file
#gdf_US_County = pd.read_pickle(download_folder / 'tl_2017_us_county.pkl')

gdf_US_County.plot()

# 8. The data is downloaded and ready for use <a class="anchor" id="s8"></a>

You now have all the data needed to begin tracking, analyzing, and visualizing the COVID-19 pandemic.