# Loading the Open COVID-19 Dataset
This very short notebook showcases how to load the [Open COVID-19 datset](https://github.com/open-covid-19/data), including some examples for commonly performed operations.

First, loading the data is very simple with `pandas`. We can use the CSV or the JSON file to download the entire Open COVID-19 dataset in a single step:

In [8]:
import pandas as pd

# Load CSV data directly from the URL with pandas
data = pd.read_csv('https://open-covid-19.github.io/data/data.csv')

# Alternatively load the JSON data, which should be identical
data_json = pd.read_json('https://open-covid-19.github.io/data/data.json')
assert len(data) == len(data_json)

# Print a small snippet of the dataset
print("Data gathering starts at: {} and ends at: {}".format(data.Date.min(), data.Date.max()))
print('The dataset currently contains {} records of {} countries, here are the last few:'.format(len(data), len(data.CountryCode.unique())))
data.tail()

Data gathering starts at: 2019-12-31 and ends at: 2020-04-13
The dataset currently contains 31346 records of 200 countries, here are the last few:


Unnamed: 0,Date,Key,CountryCode,CountryName,RegionCode,RegionName,Confirmed,Deaths,Latitude,Longitude,Population
31341,2020-04-13,JP_06,JP,Japan,6,Yamagata,39.0,,38.433,140.133,1079950.0
31342,2020-04-13,JP_18,JP,Japan,18,Fukui,92.0,,35.983,136.183,778943.0
31343,2020-04-13,JP_28,JP,Japan,28,Hyōgo,379.0,,34.690817,135.183078,5469762.0
31344,2020-04-13,JP_32,JP,Japan,32,Shimane,8.0,,35.217,132.667,689963.0
31345,2020-04-13,JP_40,JP,Japan,40,Fukuoka,365.0,,33.6,130.583,5109323.0


### Looking at country-level data
Some records contain country-level data, in other words, data that is aggregated at the country level. Other records contain region-level data, which are subdivisions of a country; for example, Chinese provinces or USA states.

To filter only country-level data from the dataset, look for records that have a null value for the region:

In [2]:
# Look for rows with null RegionCode
countries = data[data['RegionCode'].isna()]

# We no longer need the region-level columns
countries = countries.drop(columns=['RegionCode', 'RegionName'])

countries.tail()

Unnamed: 0,Date,Key,Confirmed,Deaths,CountryCode,CountryName,Latitude,Longitude,Population
13224,2020-03-29,PR,100.0,3.0,PR,Puerto Rico,18.220833,-66.590149,2933408.0
13225,2020-03-30,PR,127.0,5.0,PR,Puerto Rico,18.220833,-66.590149,2933408.0
13226,2020-03-31,PR,174.0,6.0,PR,Puerto Rico,18.220833,-66.590149,2933408.0
13227,2020-03-31,FK,0.0,0.0,FK,Falkland Islands,-51.796253,-59.523613,3377.0
13228,2020-03-31,MP,2.0,0.0,MP,Northern Mariana Islands,17.33083,145.38469,56188.0


### Looking at region-level data
Conversely, to filter region-level data for a specific country, we need to look for records where the region columns have non-null values. The following snippet extracts data related to Spain's subregions from the dataset:

In [3]:
# Filter records that have the right country code AND a non-null region code
spain_regions = data[(data['CountryCode'] == 'ES') & ~(data['RegionCode'].isna())]

spain_regions.tail()

Unnamed: 0,Date,Key,Confirmed,Deaths,CountryCode,CountryName,RegionCode,RegionName,Latitude,Longitude,Population
9404,2020-03-27,ES_VC,3532.0,198.0,ES,Spain,VC,Comunidad Valenciana,39.4697,-0.3774,
9405,2020-03-28,ES_VC,4034.0,234.0,ES,Spain,VC,Comunidad Valenciana,39.4697,-0.3774,
9406,2020-03-29,ES_VC,4784.0,267.0,ES,Spain,VC,Comunidad Valenciana,39.4697,-0.3774,
9407,2020-03-30,ES_VC,5110.0,310.0,ES,Spain,VC,Comunidad Valenciana,39.4697,-0.3774,
9408,2020-03-31,ES_VC,5508.0,339.0,ES,Spain,VC,Comunidad Valenciana,39.4697,-0.3774,


### Using the `Key` column
The `Key` column is present in all datasets and is unique for each country-region combination. This way, we can retrieve a specific country or region using a single filter for the data. The `Key` column is built using `CountryCode` for country-level data, otherwise `${CountryCode}_${RegionCode}`:

In [4]:
# Filter records for Spain at the country-level
spain_country = data[data['Key'] == 'ES']

# We no longer need the region-level columns
spain_country = spain_country.drop(columns=['RegionCode', 'RegionName'])

spain_country.tail()

Unnamed: 0,Date,Key,Confirmed,Deaths,CountryCode,CountryName,Latitude,Longitude,Population
5731,2020-03-28,ES,64059.0,4858.0,ES,Spain,40.463667,-3.74922,46736776.0
5732,2020-03-29,ES,72248.0,5690.0,ES,Spain,40.463667,-3.74922,46736776.0
5733,2020-03-30,ES,78797.0,6528.0,ES,Spain,40.463667,-3.74922,46736776.0
5734,2020-03-31,ES,85195.0,7340.0,ES,Spain,40.463667,-3.74922,46736776.0
5735,2020-04-01,ES,94417.0,8189.0,ES,Spain,40.463667,-3.74922,46736776.0


In [5]:
# Filter records for Madrid, one of the subregions of Spain
madrid = data[data['Key'] == 'ES_MD']

madrid.tail()

Unnamed: 0,Date,Key,Confirmed,Deaths,CountryCode,CountryName,RegionCode,RegionName,Latitude,Longitude,Population
9259,2020-03-27,ES_MD,19243.0,2412.0,ES,Spain,MD,Madrid,40.4165,-3.70256,
9260,2020-03-28,ES_MD,21520.0,2757.0,ES,Spain,MD,Madrid,40.4165,-3.70256,
9261,2020-03-29,ES_MD,22677.0,3082.0,ES,Spain,MD,Madrid,40.4165,-3.70256,
9262,2020-03-30,ES_MD,24090.0,3392.0,ES,Spain,MD,Madrid,40.4165,-3.70256,
9263,2020-03-31,ES_MD,27509.0,3603.0,ES,Spain,MD,Madrid,40.4165,-3.70256,


### Minimal dataset
If the `Key`, `Confirmed` and `Deaths` columns are sufficient for your application, you can get the latest data from the `data_minimal` dataset which only contains those columns essential for data analysis, here's how you would get the same data for Madrid:

In [6]:
# Load the minimal dataset
minimal = pd.read_csv('https://open-covid-19.github.io/data/data_minimal.csv')

# Filter records for Madrid, one of the subregions of Spain
madrid = minimal[minimal['Key'] == 'ES_MD']

madrid.tail()

Unnamed: 0,Date,Key,Confirmed,Deaths,CountryCode,CountryName,RegionCode,RegionName,Latitude,Longitude,Population
11720,2020-03-27,ES_MD,19243.0,2412.0,ES,Spain,MD,Madrid,40.4165,-3.70256,
12057,2020-03-28,ES_MD,21520.0,2757.0,ES,Spain,MD,Madrid,40.4165,-3.70256,
12399,2020-03-29,ES_MD,22677.0,3082.0,ES,Spain,MD,Madrid,40.4165,-3.70256,
12740,2020-03-30,ES_MD,24090.0,3392.0,ES,Spain,MD,Madrid,40.4165,-3.70256,
13060,2020-03-31,ES_MD,27509.0,3603.0,ES,Spain,MD,Madrid,40.4165,-3.70256,


### Data consistency
Often, region-level data and country-level data will come from different sources. This will lead to numbers not adding up exactly, or even date misalignment (the data for the region may be reported sooner or later than the whole country). However, country- and region- level data will *always* be self-consistent