Data Science Pipeline on Coronavirus statistics up until May 12th, 2020
This is a tutorial walking users through the entire data science pipeline: data curation, parsing, management, exploratory data analysis; hypothesis testing and machine learning. Analyzing the results and presenting them in a way that can provide inferences and predictions about the data collected.

<b> Presented by: Loc Cao, Ethan Tran </b>

The dataset that we have chosen to analyze is a COVID19 dataset maintained by Our World In Data, which can be found here:
https://ourworldindata.org/coronavirus-source-data

<b> This tutorial will be split into 3 sections: </b>
##### 1. Data Curation, parsing, and management
##### 2. Exploratory Data Analysis (EDA)
##### 3. Hypothesis and Machine Learning / Analysis

#### Section 1 - Data Curating, parsing and management

Data curating, parsing and management is the initial step in the data science process. Sometimes we may acquire data that is not well-formatted or incomplete. This data is not considered 'tidy' and the process of morhping the data into something that is organized is called tidying.

In [41]:
import pandas as pd

In [42]:
df = pd.read_csv('data/owid-covid-data.csv')
df.head()

Unnamed: 0,iso_code,location,date,total_cases,new_cases,total_deaths,new_deaths,total_cases_per_million,new_cases_per_million,total_deaths_per_million,...,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cvd_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_100k
0,ABW,Aruba,2020-03-13,2,2,0,0,18.733,18.733,0.0,...,13.085,7.452,35973.781,,,11.62,,,,
1,ABW,Aruba,2020-03-20,4,2,0,0,37.465,18.733,0.0,...,13.085,7.452,35973.781,,,11.62,,,,
2,ABW,Aruba,2020-03-24,12,8,0,0,112.395,74.93,0.0,...,13.085,7.452,35973.781,,,11.62,,,,
3,ABW,Aruba,2020-03-25,17,5,0,0,159.227,46.831,0.0,...,13.085,7.452,35973.781,,,11.62,,,,
4,ABW,Aruba,2020-03-26,19,2,0,0,177.959,18.733,0.0,...,13.085,7.452,35973.781,,,11.62,,,,


Many of the countries in this data set are missing essential statistics on Coronavirus, this may be due to a lack of information or underreporting. Whatever the reason, categories that are missing significant entries (NaNs) are dropped from the dataset because they will not be able to provide any significant insights on the dataset.

In [43]:
df.drop(['total_tests', 'new_tests', 'total_tests_per_thousand', 'new_tests_per_thousand', 
         'tests_units', 'handwashing_facilities'], axis=1, inplace = True)
df.head()

Unnamed: 0,iso_code,location,date,total_cases,new_cases,total_deaths,new_deaths,total_cases_per_million,new_cases_per_million,total_deaths_per_million,...,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cvd_death_rate,diabetes_prevalence,female_smokers,male_smokers,hospital_beds_per_100k
0,ABW,Aruba,2020-03-13,2,2,0,0,18.733,18.733,0.0,...,41.2,13.085,7.452,35973.781,,,11.62,,,
1,ABW,Aruba,2020-03-20,4,2,0,0,37.465,18.733,0.0,...,41.2,13.085,7.452,35973.781,,,11.62,,,
2,ABW,Aruba,2020-03-24,12,8,0,0,112.395,74.93,0.0,...,41.2,13.085,7.452,35973.781,,,11.62,,,
3,ABW,Aruba,2020-03-25,17,5,0,0,159.227,46.831,0.0,...,41.2,13.085,7.452,35973.781,,,11.62,,,
4,ABW,Aruba,2020-03-26,19,2,0,0,177.959,18.733,0.0,...,41.2,13.085,7.452,35973.781,,,11.62,,,


Additionally, the last few entries contained ambiguous values, that are not described in the website and contain 0's or NaNs for every attribute in the dataset. So we removed those too.

In [45]:
df = df.iloc[:16737]
df

Unnamed: 0,iso_code,location,date,total_cases,new_cases,total_deaths,new_deaths,total_cases_per_million,new_cases_per_million,total_deaths_per_million,...,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cvd_death_rate,diabetes_prevalence,female_smokers,male_smokers,hospital_beds_per_100k
0,ABW,Aruba,2020-03-13,2,2,0,0,18.733,18.733,0.000,...,41.2,13.085,7.452,35973.781,,,11.62,,,
1,ABW,Aruba,2020-03-20,4,2,0,0,37.465,18.733,0.000,...,41.2,13.085,7.452,35973.781,,,11.62,,,
2,ABW,Aruba,2020-03-24,12,8,0,0,112.395,74.930,0.000,...,41.2,13.085,7.452,35973.781,,,11.62,,,
3,ABW,Aruba,2020-03-25,17,5,0,0,159.227,46.831,0.000,...,41.2,13.085,7.452,35973.781,,,11.62,,,
4,ABW,Aruba,2020-03-26,19,2,0,0,177.959,18.733,0.000,...,41.2,13.085,7.452,35973.781,,,11.62,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16732,OWID_WRL,World,2020-05-08,3809238,94422,269249,5748,488.690,12.113,34.542,...,30.9,8.696,5.355,15469.207,10.0,233.07,8.51,6.434,34.635,2.705
16733,OWID_WRL,World,2020-05-09,3899355,90117,274517,5268,500.251,11.561,35.218,...,30.9,8.696,5.355,15469.207,10.0,233.07,8.51,6.434,34.635,2.705
16734,OWID_WRL,World,2020-05-10,3986907,87552,278957,4440,511.483,11.232,35.788,...,30.9,8.696,5.355,15469.207,10.0,233.07,8.51,6.434,34.635,2.705
16735,OWID_WRL,World,2020-05-11,4066549,79642,282367,3410,521.700,10.217,36.225,...,30.9,8.696,5.355,15469.207,10.0,233.07,8.51,6.434,34.635,2.705


The final tidy dataset contains 16737 entries and 23 attributes. However, it must be noted that some of the attributes will not be analyzed in this tutorial but are kept because they are mostly complete and could be of use for a separate analysis later.

#### Section 2 - Exploratory Data Analysis

In [None]:
#TODO

#### Section 3 - Hypothesis Testing & Machine Learning + Analysis

In [46]:
#TODO