# CSSE COVID-19 Dataset

## Daily reports (csse_covid_19_daily_reports)

This folder contains daily case reports. All timestamps are in UTC (GMT+0).

### File naming convention
MM-DD-YYYY.csv in UTC.

### Field description
* Province/State: China - province name; US/Canada/Australia/ - city name, state/province name; Others - name of the event (e.g., "Diamond Princess" cruise ship); other countries - blank.
* Country/Region: country/region name conforming to WHO (will be updated).
* Last Update: MM/DD/YYYY HH:mm  (24 hour format, in UTC).
* Confirmed: the number of confirmed cases. For Hubei Province: from Feb 13 (GMT +8), we report both clinically diagnosed and lab-confirmed cases. For lab-confirmed cases only (Before Feb 17), please refer to [who_covid_19_situation_reports](https://github.com/CSSEGISandData/COVID-19/tree/master/who_covid_19_situation_reports). For Italy, diagnosis standard might be changed since Feb 27 to "slow the growth of new case numbers." ([Source](https://apnews.com/6c7e40fbec09858a3b4dbd65fe0f14f5))
* Deaths: the number of deaths.
* Recovered: the number of recovered cases.

### Update frequency
* Files after Feb 1 (UTC): once a day around 23:59 (UTC).
* Files on and before Feb 1 (UTC): the last updated files before 23:59 (UTC). Sources: [archived_data](https://github.com/CSSEGISandData/COVID-19/tree/master/archived_data) and dashboard.

### Data sources
Refer to the [mainpage](https://github.com/CSSEGISandData/COVID-19).

### Why create this new folder?
1. Unifying all timestamps to UTC, including the file name and the "Last Update" field.
2. Pushing only one file every day.
3. All historic data is archived in [archived_data](https://github.com/CSSEGISandData/COVID-19/tree/master/archived_data).

---
## Time series summary (csse_covid_19_time_series)

This folder contains daily time series summary tables, including confirmed, deaths and recovered. All data are from the daily case report.

### Field descriptioin
* Province/State: same as above.
* Country/Region: same as above.
* Lat and Long: a coordinates reference for the user.
* Date fields: M/DD/YYYY (UTC), the same data as MM-DD-YYYY.csv file.

### Update frequency
* Once a day.

---
## Data modification records
We are also monitoring the curve change. Any errors made by us will be corrected in the dataset. Any possible errors from the original data sources will be listed here as a reference.
* NHC 2/14: Hubei Province deducted 108 prior deaths from the death toll due to double counting.
* About DP 3/1: All cases of COVID-19 in repatriated US citizens from the Diamond Princess are grouped together, and their location is currently designated at the ship’s port location off the coast of Japan. These individuals have been assigned to various quarantine locations (in military bases and hospitals) around the US. This grouping is consistent with the CDC.

---
## UID Lookup Table Logic

1.	All countries without dependencies (entries with only Admin0).
  *	None cruise ship Admin0: UID = code3. (e.g., Afghanistan, UID = code3 = 4)
  *	Cruise ships in Admin0: Diamond Princess UID = 9999, MS Zaandam UID = 8888.
2.	All countries with only state-level dependencies (entries with Admin0 and Admin1).
  *	Demark, France, Netherlands: mother countries and their dependencies have different code3, therefore UID = code 3. (e.g., Faroe Islands, Denmark, UID = code3 = 234; Denmark UID = 208)
  *	United Kingdom: the mother country and dependencies have different code3s, therefore UID = code 3. One exception: Channel Islands is using the same code3 as the mother country (826), and its artificial UID = 8261.
  *	Australia: alphabetically ordered all states, and their UIDs are from 3601 to 3608. Australia itself is 36.
  *	Canada: alphabetically ordered all provinces (including cruise ships and recovered entry), and their UIDs are from 12401 to 12415. Canada itself is 124.
  *	China: alphabetically ordered all provinces, and their UIDs are from 15601 to 15631. China itself is 156. Hong Kong and Macau have their own code3.
3.	The US (most entries with Admin0, Admin1 and Admin2).
  *	US by itself is 840 (UID = code3).
  *	US dependencies, American Samoa, Guam, Northern Mariana Islands, Virgin Islands and Puerto Rico, UID = code3. Their FIPS codes are different from code3.
  *	US states: UID = 840 (country code3) + 000XX (state FIPS code). Ranging from 8400001 to 84000056.
  *	Out of [State], US: UID = 840 (country code3) + 800XX (state FIPS code). Ranging from 8408001 to 84080056.
  *	Unassigned, US: UID = 840 (country code3) + 900XX (state FIPS code). Ranging from 8409001 to 84090056.
  *	US counties: UID = 840 (country code3) + XXXXX (5-digit FIPS code).
  *	Exception type 1, such as recovered and Kansas City, ranging from 8407001 to 8407999.
  *	Exception type 2, only the New York City, which is replacing New York County and its FIPS code.
  *	Exception type 3, Diamond Princess, US: 84088888; Grand Princess, US: 84099999.



In [43]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [44]:
raw_data_confirmed = pd.read_csv(r"C:\Users\Himanshu Agarwal\Desktop\Python programs\ML using Python Microsoft\csse_covid_19_data\csse_covid_19_data\csse_covid_19_time_series\time_series_covid19_confirmed_global.csv")
raw_data_deaths = pd.read_csv(r"C:\Users\Himanshu Agarwal\Desktop\Python programs\ML using Python Microsoft\csse_covid_19_data\csse_covid_19_data\csse_covid_19_time_series\time_series_covid19_deaths_global.csv")
raw_data_recovered = pd.read_csv(r"C:\Users\Himanshu Agarwal\Desktop\Python programs\ML using Python Microsoft\csse_covid_19_data\csse_covid_19_data\csse_covid_19_time_series\time_series_covid19_recovered_global.csv")


In [45]:
print(raw_data_confirmed.shape)

(263, 81)


In [46]:
print(raw_data_deaths.shape)

(263, 81)


In [47]:
print(raw_data_recovered.shape)

(249, 81)


## Total confirmed cases

In [48]:
confirm_cases = raw_data_confirmed.copy()
print(confirm_cases.head())

  Province/State Country/Region      Lat     Long  1/22/20  1/23/20  1/24/20  \
0            NaN    Afghanistan  33.0000  65.0000        0        0        0   
1            NaN        Albania  41.1533  20.1683        0        0        0   
2            NaN        Algeria  28.0339   1.6596        0        0        0   
3            NaN        Andorra  42.5063   1.5218        0        0        0   
4            NaN         Angola -11.2027  17.8739        0        0        0   

   1/25/20  1/26/20  1/27/20  ...  3/29/20  3/30/20  3/31/20  4/1/20  4/2/20  \
0        0        0        0  ...      120      170      174     237     273   
1        0        0        0  ...      212      223      243     259     277   
2        0        0        0  ...      511      584      716     847     986   
3        0        0        0  ...      334      370      376     390     428   
4        0        0        0  ...        7        7        7       8       8   

   4/3/20  4/4/20  4/5/20  4/6/20  4/7

In [49]:
confirm_cases.set_index('Country/Region',inplace=True)

In [50]:
print(confirm_cases.head())

               Province/State      Lat     Long  1/22/20  1/23/20  1/24/20  \
Country/Region                                                               
Afghanistan               NaN  33.0000  65.0000        0        0        0   
Albania                   NaN  41.1533  20.1683        0        0        0   
Algeria                   NaN  28.0339   1.6596        0        0        0   
Andorra                   NaN  42.5063   1.5218        0        0        0   
Angola                    NaN -11.2027  17.8739        0        0        0   

                1/25/20  1/26/20  1/27/20  1/28/20  ...  3/29/20  3/30/20  \
Country/Region                                      ...                     
Afghanistan           0        0        0        0  ...      120      170   
Albania               0        0        0        0  ...      212      223   
Algeria               0        0        0        0  ...      511      584   
Andorra               0        0        0        0  ...      334    

In [51]:
confirm_cases.sort_values('Country/Region',inplace=True)

In [52]:
confirm_cases.head()

Unnamed: 0_level_0,Province/State,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,...,3/29/20,3/30/20,3/31/20,4/1/20,4/2/20,4/3/20,4/4/20,4/5/20,4/6/20,4/7/20
Country/Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,,33.0,65.0,0,0,0,0,0,0,0,...,120,170,174,237,273,281,299,349,367,423
Albania,,41.1533,20.1683,0,0,0,0,0,0,0,...,212,223,243,259,277,304,333,361,377,383
Algeria,,28.0339,1.6596,0,0,0,0,0,0,0,...,511,584,716,847,986,1171,1251,1320,1423,1468
Andorra,,42.5063,1.5218,0,0,0,0,0,0,0,...,334,370,376,390,428,439,466,501,525,545
Angola,,-11.2027,17.8739,0,0,0,0,0,0,0,...,7,7,7,8,8,8,10,14,16,17


In [53]:
confirm_cases.drop_duplicates(keep='first',inplace=True)

In [54]:
confirm_cases.shape

(263, 80)

In [55]:
confirm_cases.isnull().sum()

Province/State    181
Lat                 0
Long                0
1/22/20             0
1/23/20             0
                 ... 
4/3/20              0
4/4/20              0
4/5/20              0
4/6/20              0
4/7/20              0
Length: 80, dtype: int64

In [57]:
#After cleaning data, let's check number of cases of all countries

confirm_cases.drop(['Province/State','Lat','Long'],axis=1,inplace=True)
confirm_cases.head()

Unnamed: 0_level_0,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,1/31/20,...,3/29/20,3/30/20,3/31/20,4/1/20,4/2/20,4/3/20,4/4/20,4/5/20,4/6/20,4/7/20
Country/Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,0,0,0,0,0,0,0,0,0,0,...,120,170,174,237,273,281,299,349,367,423
Albania,0,0,0,0,0,0,0,0,0,0,...,212,223,243,259,277,304,333,361,377,383
Algeria,0,0,0,0,0,0,0,0,0,0,...,511,584,716,847,986,1171,1251,1320,1423,1468
Andorra,0,0,0,0,0,0,0,0,0,0,...,334,370,376,390,428,439,466,501,525,545
Angola,0,0,0,0,0,0,0,0,0,0,...,7,7,7,8,8,8,10,14,16,17


In [58]:
confirm_cases_T = confirm_cases.T
confirm_cases_T.head()

Country/Region,Afghanistan,Albania,Algeria,Andorra,Angola,Antigua and Barbuda,Argentina,Armenia,Australia,Australia.1,...,United Kingdom,United Kingdom.1,Uruguay,Uzbekistan,Venezuela,Vietnam,West Bank and Gaza,Western Sahara,Zambia,Zimbabwe
1/22/20,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1/23/20,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,2,0,0,0,0
1/24/20,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,2,0,0,0,0
1/25/20,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,2,0,0,0,0
1/26/20,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,2,0,0,0,0


In [59]:
confirm_cases_T.sum()

Country/Region
Afghanistan            3454
Albania                4526
Algeria               13294
Andorra                6081
Angola                  129
                      ...  
Vietnam                4593
West Bank and Gaza     2813
Western Sahara           12
Zambia                  464
Zimbabwe                118
Length: 263, dtype: int64

In [62]:
#This represents number of cases per country.