# Basic data loading
This notebook goes over how to load data using the `covid19pandas` package. We will cover the many loading options available.

First, import the package.

In [1]:
import covid19pandas as cod

## Data source options
Our package provides access to data from both [Johns Hopkins University](https://github.com/CSSEGISandData/COVID-19) and [The New York Times](https://github.com/nytimes/covid-19-data). We provide both data sources for comparison, as they aggregate data from different sources.

Here are some details about both datasets:
### Johns Hopkins
- Function for access: `get_data_jhu`
- Provides both global and US data
- US data is broken down into states and counties
- For global data, provides confirmed cases, deaths, and recovered counts
- For US data, provides only confirmed cases and deaths counts.

Note: Johns Hopkins [has stated](https://github.com/CSSEGISandData/COVID-19/issues/1250#issuecomment-606354840) that the reason they do not provide recovered counts for their US-specific data is that they cannot find a reliable source for recovered counts at the county level.

You can easily get JHU's global data with just a single function call, `get_data_jhu()`. By default, the table is formatted so that each row represents a single day (see the "date" column) for a given locale, including total cases, total deaths, and total recovered individuals.

In [2]:
df = cod.get_data_jhu()
df.head(5)

These data were obtained from Johns Hopkins University (https://github.com/CSSEGISandData/COVID-19).


Unnamed: 0,date,UID,iso2,iso3,code3,Province/State,Country/Region,Lat,Long,Combined_Key,Population,cases,deaths,recovered
0,2020-01-22,4,AF,AFG,4.0,,Afghanistan,33.93911,67.709953,Afghanistan,38928341.0,0,0,0
1,2020-01-22,8,AL,ALB,8.0,,Albania,41.1533,20.1683,Albania,2877800.0,0,0,0
2,2020-01-22,12,DZ,DZA,12.0,,Algeria,28.0339,1.6596,Algeria,43851043.0,0,0,0
3,2020-01-22,20,AD,AND,20.0,,Andorra,42.5063,1.5218,Andorra,77265.0,0,0,0
4,2020-01-22,24,AO,AGO,24.0,,Angola,-11.2027,17.8739,Angola,32866268.0,0,0,0


In [3]:
df.tail(5)

Unnamed: 0,date,UID,iso2,iso3,code3,Province/State,Country/Region,Lat,Long,Combined_Key,Population,cases,deaths,recovered
31768,2020-05-19,275,PS,PSE,275.0,,West Bank and Gaza,31.9522,35.2332,WestBankandGaza,5101416.0,391,2,346
31769,2020-05-19,732,EH,ESH,732.0,,Western Sahara,24.2155,-12.8858,WesternSahara,597330.0,6,0,6
31770,2020-05-19,887,YE,YEM,887.0,,Yemen,15.552727,48.516388,Yemen,29825968.0,167,28,5
31771,2020-05-19,894,ZM,ZMB,894.0,,Zambia,-13.133897,27.849332,Zambia,18383956.0,772,7,192
31772,2020-05-19,716,ZW,ZWE,716.0,,Zimbabwe,-19.015438,29.154857,Zimbabwe,14862927.0,46,4,18


### The New York Times
- Function for access: `get_data_nyt`
- Provides only US data
- Has both state and county level data
- Provides confirmed cases and deaths counts.

You can get NYT data for the US also with a single function call, `get_data_nyt()`. By default, it comes in the same format as the JHU data--each row is a single day for a given locale. (The "fips" column is a location identifier.)

In [4]:
df = cod.get_data_nyt()
df.head(5)

These data were obtained from The New York Times (https://github.com/nytimes/covid-19-data).


Unnamed: 0,date,state,fips,cases,deaths
0,2020-01-21,Washington,53,1,0
1,2020-01-22,Washington,53,1,0
2,2020-01-23,Washington,53,1,0
3,2020-01-24,Illinois,17,1,0
4,2020-01-24,Washington,53,1,0


In [5]:
df.tail(5)

Unnamed: 0,date,state,fips,cases,deaths
4299,2020-05-19,Virginia,51,32145,1041
4300,2020-05-19,Washington,53,20063,1042
4301,2020-05-19,West Virginia,54,1514,68
4302,2020-05-19,Wisconsin,55,13001,467
4303,2020-05-19,Wyoming,56,776,10


## Data type options
By default, our package will return all data types, meaning counts of both confirmed cases and deaths, and counts of recovered patients if available. You may want a table that only has one data type. To get a table with just one data type, pass that data type as a string to the `data_type` parameter. Pass `"cases"` to get counts of confirmed cases, `"deaths"` to get counts of deaths, or `"recovered"` to get counts of recoveries.

For example, to get just the counts of confirmed cases:

In [6]:
df = cod.get_data_jhu(data_type="cases")
df.head(10)

These data were obtained from Johns Hopkins University (https://github.com/CSSEGISandData/COVID-19).


Unnamed: 0,date,UID,iso2,iso3,code3,Province/State,Country/Region,Lat,Long,Combined_Key,Population,cases
0,2020-01-22,4,AF,AFG,4.0,,Afghanistan,33.93911,67.709953,Afghanistan,38928341.0,0
1,2020-01-22,8,AL,ALB,8.0,,Albania,41.1533,20.1683,Albania,2877800.0,0
2,2020-01-22,12,DZ,DZA,12.0,,Algeria,28.0339,1.6596,Algeria,43851043.0,0
3,2020-01-22,20,AD,AND,20.0,,Andorra,42.5063,1.5218,Andorra,77265.0,0
4,2020-01-22,24,AO,AGO,24.0,,Angola,-11.2027,17.8739,Angola,32866268.0,0
5,2020-01-22,28,AG,ATG,28.0,,Antigua and Barbuda,17.0608,-61.7964,AntiguaandBarbuda,97928.0,0
6,2020-01-22,32,AR,ARG,32.0,,Argentina,-38.4161,-63.6167,Argentina,45195777.0,0
7,2020-01-22,51,AM,ARM,51.0,,Armenia,40.0691,45.0382,Armenia,2963234.0,0
8,2020-01-22,3601,AU,AUS,36.0,Australian Capital Territory,Australia,-35.4735,149.0124,"AustralianCapitalTerritory,Australia",428100.0,0
9,2020-01-22,3602,AU,AUS,36.0,New South Wales,Australia,-33.8688,151.2093,"NewSouthWales,Australia",8118000.0,0


If you ask for an unavailable data type (e.g. passing `"recovered"` to the `get_data_nyt` function), the package will notify you and throw an exception. Passing `"all"` (the default argument) will return all available data types.

## Dataframe format options
Our package provides the option of returning data tables in either the wide or long format. A wide format table has a separate column for every variable, whereas a long format table has one column for all variables. (See [this explanation](https://en.wikipedia.org/wiki/Wide_and_narrow_data) from Wikipedia for a more in-depth discussion of the differences.)

Many plotting tools prefer data in the long format. However, the wide format can also be useful, and sometimes makes data easier to think about. So, we provide both options. The actual data is the same either way; we just convert between the different formats for you.

Our package defaults to returning tables in the long format. To get a table in the wide format instead, pass `"wide"` to the `format` parameter in the getter function. With a wide format table, you can only have one data type (cases, deaths, or recovered) in the table. So, you must pass an argument other that the default `"all"` to the `data_type` parameter. If you want multiple data types in the wide format, you need a separate table for each data type.

In [7]:
df = cod.get_data_jhu(format="wide", data_type="cases")
df.head(10)

These data were obtained from Johns Hopkins University (https://github.com/CSSEGISandData/COVID-19).


Unnamed: 0,UID,iso2,iso3,code3,Province/State,Country/Region,Lat,Long,Combined_Key,Population,...,2020-05-10 00:00:00,2020-05-11 00:00:00,2020-05-12 00:00:00,2020-05-13 00:00:00,2020-05-14 00:00:00,2020-05-15 00:00:00,2020-05-16 00:00:00,2020-05-17 00:00:00,2020-05-18 00:00:00,2020-05-19 00:00:00
0,4,AF,AFG,4.0,,Afghanistan,33.93911,67.709953,Afghanistan,38928341.0,...,4402,4687,4963,5226,5639,6053,6402,6664,7072,7653
1,8,AL,ALB,8.0,,Albania,41.1533,20.1683,Albania,2877800.0,...,868,872,876,880,898,916,933,946,948,949
2,12,DZ,DZA,12.0,,Algeria,28.0339,1.6596,Algeria,43851043.0,...,5723,5891,6067,6253,6442,6629,6821,7019,7201,7377
3,20,AD,AND,20.0,,Andorra,42.5063,1.5218,Andorra,77265.0,...,755,755,758,760,761,761,761,761,761,761
4,24,AO,AGO,24.0,,Angola,-11.2027,17.8739,Angola,32866268.0,...,45,45,45,45,48,48,48,48,50,52
5,28,AG,ATG,28.0,,Antigua and Barbuda,17.0608,-61.7964,AntiguaandBarbuda,97928.0,...,25,25,25,25,25,25,25,25,25,25
6,32,AR,ARG,32.0,,Argentina,-38.4161,-63.6167,Argentina,45195777.0,...,6034,6278,6563,6879,7134,7479,7805,8068,8371,8809
7,51,AM,ARM,51.0,,Armenia,40.0691,45.0382,Armenia,2963234.0,...,3313,3392,3538,3718,3860,4044,4283,4472,4823,5041
8,3601,AU,AUS,36.0,Australian Capital Territory,Australia,-35.4735,149.0124,"AustralianCapitalTerritory,Australia",428100.0,...,107,107,107,107,107,107,107,107,107,107
9,3602,AU,AUS,36.0,New South Wales,Australia,-33.8688,151.2093,"NewSouthWales,Australia",8118000.0,...,3053,3053,3059,3063,3071,3074,3075,3076,3078,3081


Note how each date (each variable) now has its own column.

## JHU Data: Region options
Johns Hopkins provides both global and US data. The global data is organized by country, but broken down into states or provinces where possible. The US data is organized by county. Global data includes recovered counts, as well as cases and deaths, but the US data only includes cases and deaths. Both tables include latitude and longitude for mapping each sample, and the US table includes additional geographical identifiers.

The `get_data_jhu` function defaults to returning global data. To get US data, pass `"us"` to the `region` parameter.

In [8]:
df = cod.get_data_jhu(region="us")
df.head(10)

These data were obtained from Johns Hopkins University (https://github.com/CSSEGISandData/COVID-19).


Unnamed: 0,date,UID,iso2,iso3,code3,FIPS,Admin2,Province_State,Country_Region,Lat,Long,Combined_Key,Population,cases,deaths
0,2020-01-22,84001001,US,USA,840.0,1001.0,Autauga,Alabama,US,32.539527,-86.644082,"Autauga,Alabama,US",55869.0,0,0
1,2020-01-22,84001003,US,USA,840.0,1003.0,Baldwin,Alabama,US,30.72775,-87.722071,"Baldwin,Alabama,US",223234.0,0,0
2,2020-01-22,84001005,US,USA,840.0,1005.0,Barbour,Alabama,US,31.868263,-85.387129,"Barbour,Alabama,US",24686.0,0,0
3,2020-01-22,84001007,US,USA,840.0,1007.0,Bibb,Alabama,US,32.996421,-87.125115,"Bibb,Alabama,US",22394.0,0,0
4,2020-01-22,84001009,US,USA,840.0,1009.0,Blount,Alabama,US,33.982109,-86.567906,"Blount,Alabama,US",57826.0,0,0
5,2020-01-22,84001011,US,USA,840.0,1011.0,Bullock,Alabama,US,32.100305,-85.712655,"Bullock,Alabama,US",10101.0,0,0
6,2020-01-22,84001013,US,USA,840.0,1013.0,Butler,Alabama,US,31.753001,-86.680575,"Butler,Alabama,US",19448.0,0,0
7,2020-01-22,84001015,US,USA,840.0,1015.0,Calhoun,Alabama,US,33.774837,-85.826304,"Calhoun,Alabama,US",113605.0,0,0
8,2020-01-22,84001017,US,USA,840.0,1017.0,Chambers,Alabama,US,32.913601,-85.390727,"Chambers,Alabama,US",33254.0,0,0
9,2020-01-22,84001019,US,USA,840.0,1019.0,Cherokee,Alabama,US,34.17806,-85.60639,"Cherokee,Alabama,US",26196.0,0,0


## NYT Data: County option
The New York Times provides both state and county level data for the US. Both types incude just counts for confirmed cases and deaths. The tables include [FIPS codes](https://www.census.gov/quickfacts/fact/note/US/fips) with each row for geographical identification.

The `get_data_nyt` function defaults to state level data. To get county level data, pass `True` to the `counties` parameter.

In [9]:
df = cod.get_data_nyt(counties=True)
df.head(10)

These data were obtained from The New York Times (https://github.com/nytimes/covid-19-data).


Unnamed: 0,date,county,state,fips,cases,deaths
0,2020-01-21,Snohomish,Washington,53061.0,1,0
1,2020-01-22,Snohomish,Washington,53061.0,1,0
2,2020-01-23,Snohomish,Washington,53061.0,1,0
3,2020-01-24,Cook,Illinois,17031.0,1,0
4,2020-01-24,Snohomish,Washington,53061.0,1,0
5,2020-01-25,Orange,California,6059.0,1,0
6,2020-01-25,Cook,Illinois,17031.0,1,0
7,2020-01-25,Snohomish,Washington,53061.0,1,0
8,2020-01-26,Maricopa,Arizona,4013.0,1,0
9,2020-01-26,Los Angeles,California,6037.0,1,0


## Update options
By default, every time you load a table, the package will send a web request to the original source of the table and re-download it, to make sure you have the most up-to-date version of the table. This normally does not take very long. Each time you load a table, the package saves a copy of it on your local machine, so that you can still load the table later even if you are somewhere without internet connection.

However, you may want to skip this update option to avoid small changes in the data you're working with. It could also speed things up if you have a slow internet connection. To skip the data update step, pass `False` to the `update` parameter when you load the table. The package will warn you that the table has not been updated.

In [10]:
df = cod.get_data_jhu(update=False)
df.head(10)



These data were obtained from Johns Hopkins University (https://github.com/CSSEGISandData/COVID-19).




Unnamed: 0,date,UID,iso2,iso3,code3,Province/State,Country/Region,Lat,Long,Combined_Key,Population,cases,deaths,recovered
0,2020-01-22,4,AF,AFG,4.0,,Afghanistan,33.93911,67.709953,Afghanistan,38928341.0,0,0,0
1,2020-01-22,8,AL,ALB,8.0,,Albania,41.1533,20.1683,Albania,2877800.0,0,0,0
2,2020-01-22,12,DZ,DZA,12.0,,Algeria,28.0339,1.6596,Algeria,43851043.0,0,0,0
3,2020-01-22,20,AD,AND,20.0,,Andorra,42.5063,1.5218,Andorra,77265.0,0,0,0
4,2020-01-22,24,AO,AGO,24.0,,Angola,-11.2027,17.8739,Angola,32866268.0,0,0,0
5,2020-01-22,28,AG,ATG,28.0,,Antigua and Barbuda,17.0608,-61.7964,AntiguaandBarbuda,97928.0,0,0,0
6,2020-01-22,32,AR,ARG,32.0,,Argentina,-38.4161,-63.6167,Argentina,45195777.0,0,0,0
7,2020-01-22,51,AM,ARM,51.0,,Armenia,40.0691,45.0382,Armenia,2963234.0,0,0,0
8,2020-01-22,3601,AU,AUS,36.0,Australian Capital Territory,Australia,-35.4735,149.0124,"AustralianCapitalTerritory,Australia",428100.0,0,0,0
9,2020-01-22,3602,AU,AUS,36.0,New South Wales,Australia,-33.8688,151.2093,"NewSouthWales,Australia",8118000.0,0,0,0
