# Coronavirus Data Modeling


### Background
From Wikipedia...

"The 2019–20 coronavirus pandemic is an ongoing global pandemic of coronavirus disease 2019 (COVID-19) caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The virus was first reported in Wuhan, Hubei, China, in December 2019.[5][6] On March 11, 2020, the World Health Organization declared the outbreak a pandemic.[7] As of March 12, 2020, over 134,000 cases have been confirmed in more than 120 countries and territories, with major outbreaks in mainland China, Italy, South Korea, and Iran.[3] Around 5,000 people, with about 3200 from China, have died from the disease. More than 69,000 have recovered.[4]

The virus spreads between people in a way similar to influenza, via respiratory droplets from coughing.[8][9][10] The time between exposure and symptom onset is typically five days, but may range from two to fourteen days.[10][11] Symptoms are most often fever, cough, and shortness of breath.[10][11] Complications may include pneumonia and acute respiratory distress syndrome. There is currently no vaccine or specific antiviral treatment, but research is ongoing. Efforts are aimed at managing symptoms and supportive therapy. Recommended preventive measures include handwashing, maintaining distance from other people (particularly those who are sick), and monitoring and self-isolation for fourteen days for people who suspect they are infected.[9][10][12]

Public health responses around the world have included travel restrictions, quarantines, curfews, event cancellations, and school closures. They have included the quarantine of all of Italy and the Chinese province of Hubei; various curfew measures in China and South Korea;[13][14][15] screening methods at airports and train stations;[16] and travel advisories regarding regions with community transmission.[17][18][19][20] Schools have closed nationwide in 22 countries or locally in 17 countries, affecting more than 370 million students.[21]"

https://en.wikipedia.org/wiki/2019–20_coronavirus_pandemic 

For ADDITIONAL BACKGROUND, see JHU's COVID-19 Resource Center:
https://coronavirus.jhu.edu/




### Visualizations
Bias in Data
https://medium.com/@tomaspueyo/coronavirus-act-today-or-people-will-die-f4d3d9cd99ca

As INSPIRATION, there is the now-famous JHU CSSE dashboard:
https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6

https://informationisbeautiful.net/visualizations/covid-19-coronavirus-infographic-datapack/

### Python


### R
A great starting point for data analytics CODING is the work of Tim Churches (thru 10 Mar): 
* "COIVD-19 epidemiology with R"
https://rviews.rstudio.com/2020/03/05/covid-19-epidemiology-with-r/
* "Analysing COVID-19 (2019-nCoV) outbreak data with R - part 1"
http://bit.ly/2xvevpT
* "Analysing COVID-19 (2019-nCoV) outbreak data with R - part 2"
http://bit.ly/2Qdii1B
* "Modelling the effects of public health interventions on COVID-19 transmission using R - part 1"
http://bit.ly/2TXlBv2

Tim's posts serve as tutorials on the use of several epidemiological R PACKAGES from the R Epidemics Consortium:
https://www.repidemicsconsortium.org/projects/

### DATASETS
* The "mother load" is JHU CSSE's curated repository, updated with global data multiple times daily; forms the basis for the JHU CSSE dashboard: 
https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data

* Eric Brown is curating a global outbreak dataset extracted from WHO situation reports on github, formatted as an R package for easy installation; data is available as an Rda file:
https://github.com/eebrown/data2019nCoV

* Kaggle Novel Corona Virus 2019 Dataset: Day level information on covid-19 affected cases
https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset

Finally, as motivation, see these various CHALLENGES and CfPs:
* March 2020 R Consortium ISC CfP:
https://www.r-consortium.org/blog/2020/03/11/march-2020-isc-call-for-proposals

* NSF 20-052: Dear Colleague Letter on the Coronavirus Disease 2019 (COVID-19)
https://www.nsf.gov/pubs/2020/nsf20052/nsf20052.jsp?org=NSF

* Kaggle 
https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset

(Thanks to John Erikson for putting together R/Dataset resources)

### The Assignment

Our lives have been seriously disrupted by the coronavirus pandemic, and there is every indication that this is going to be a global event which requires colloration in a global community to solve.  Studying the data provides an opportunity to connect the pandemic to the variety of themes from the class. 

A number of folks have already been examining this data. 
https://ourworldindata.org/coronavirus-source-data


1. Discussion.  What is the role of open data?  Why is it important in this case?

In [0]:
###Answer here.
q1="""

"""

2. What is the role of bias in the data?  Identify 3 different ways that the data could be biased.  

In [0]:
###Answer here. 
q2="""

"""

In [0]:
#Load some data
import pandas as pd
df=pd.read_csv('http://cowid.netlify.com/data/full_data.csv')
df

Unnamed: 0,date,location,new_cases,new_deaths,total_cases,total_deaths
0,2020-02-25,Afghanistan,,,1,
1,2020-02-26,Afghanistan,0.0,,1,
2,2020-02-27,Afghanistan,0.0,,1,
3,2020-02-28,Afghanistan,0.0,,1,
4,2020-02-29,Afghanistan,0.0,,1,
...,...,...,...,...,...,...
2140,2020-03-08,Worldwide,3644.0,96.0,105592,3584.0
2141,2020-03-09,Worldwide,3979.0,224.0,109577,3809.0
2142,2020-03-10,Worldwide,4119.0,201.0,113702,4012.0
2143,2020-03-11,Worldwide,4611.0,275.0,118319,4292.0


### Preprocessing
We have to deal with missing values first.

First let's check the missing values for each column. 

In [0]:
df.isnull().sum() 

date               0
location           0
new_cases        119
new_deaths      1793
total_cases        0
total_deaths    1764
dtype: int64

In [0]:
Rather than filling NAs as 

Rather than filling NAs as 0s, let's be a little more conservative and drop NAs. 

In [0]:
df.dropna(inplace=True)

In [0]:
df.isnull().sum() 

date            0
location        0
new_cases       0
new_deaths      0
total_cases     0
total_deaths    0
dtype: int64

In [0]:
df['location'].unique()

array(['Argentina', 'Australia', 'Canada', 'China', 'Egypt', 'France',
       'Germany', 'Indonesia', 'International', 'Iran', 'Iraq', 'Italy',
       'Japan', 'Lebanon', 'Morocco', 'Netherlands', 'Panama',
       'Philippines', 'South Korea', 'San Marino', 'Spain', 'Switzerland',
       'Thailand', 'United Kingdom', 'United States', 'Worldwide'],
      dtype=object)

In [0]:
df['date'].unique()

array(['2020-03-09', '2020-03-10', '2020-03-11', '2020-03-12',
       '2020-03-03', '2020-03-04', '2020-03-05', '2020-03-06',
       '2020-03-07', '2020-03-08', '2020-01-22', '2020-01-23',
       '2020-01-24', '2020-01-25', '2020-01-26', '2020-01-27',
       '2020-01-28', '2020-01-29', '2020-01-30', '2020-01-31',
       '2020-02-01', '2020-02-02', '2020-02-03', '2020-02-04',
       '2020-02-05', '2020-02-06', '2020-02-07', '2020-02-08',
       '2020-02-09', '2020-02-10', '2020-02-11', '2020-02-12',
       '2020-02-13', '2020-02-14', '2020-02-15', '2020-02-16',
       '2020-02-17', '2020-02-18', '2020-02-19', '2020-02-20',
       '2020-02-21', '2020-02-22', '2020-02-23', '2020-02-24',
       '2020-02-25', '2020-02-26', '2020-02-27', '2020-02-28',
       '2020-02-29', '2020-03-01', '2020-03-02', '2020-01-21'],
      dtype=object)

In [0]:
pd.pivot_table(df, index='location', values=['new_cases', 'new_deaths', 'total_cases', 'total_deaths'], aggfunc = 'max').sort_values('total_deaths', ascending=False)

Unnamed: 0_level_0,new_cases,new_deaths,total_cases,total_deaths
location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Worldwide,19572.0,317.0,125048,4613.0
China,19461.0,254.0,80981,3173.0
Italy,2313.0,196.0,12462,827.0
Iran,1234.0,63.0,9000,354.0
South Korea,813.0,7.0,7869,66.0
France,495.0,15.0,2269,48.0
Spain,615.0,18.0,2140,48.0
United States,291.0,8.0,987,29.0
Japan,59.0,3.0,620,15.0
International,61.0,2.0,706,7.0


In [0]:
pd.DataFrame.sort_values()