<a href="https://colab.research.google.com/github/Elzawawy/covid-case-estimator/blob/master/Our_World_In_Data_Exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Our World In Data Dataset Exploration

Check out the [Dataset Website](https://ourworldindata.org/coronavirus) !

Also it's available on this [Github Repository](https://github.com/owid/covid-19-data) !

The dataset has 207 **country profiles** which allow you to explore the statistics on the coronavirus pandemic for every country in the world. Every country profile is updated daily. Every profile includes **four sections**:

*  How many people have died from the coronavirus?
*  How much testing for coronavirus do countries conduct? 
*  How many cases were confirmed?
*  What measures did countries take in response to the pandemic?



In [2]:
#imports cell
import pandas as pd
import numpy as np
import pickle
from shutil import copyfile

# mount google drive to copy files from repo into drive.
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


## Download the dataset
The dataset can be downloaded either from the **website download URL** or from the Github repository file URL.
After downloading the CSV file, I copy it to the permenant storage on Google Drive for future usage.

In [3]:
!wget -O owid-covid-data.csv https://covid.ourworldindata.org/data/owid-covid-data.csv
OWID_COVID_DATA_FILE = "/content/owid-covid-data.csv"
STORAGE_DIR = "/content/drive/My Drive/COVID-19/our-world-in-data/"
copyfile(OWID_COVID_DATA_FILE, STORAGE_DIR+"owid-covid-data.csv");

--2020-05-16 06:49:44--  https://covid.ourworldindata.org/data/owid-covid-data.csv
Resolving covid.ourworldindata.org (covid.ourworldindata.org)... 104.248.63.231, 2604:a880:400:d1::89c:7001
Connecting to covid.ourworldindata.org (covid.ourworldindata.org)|104.248.63.231|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2488418 (2.4M) [text/csv]
Saving to: ‘owid-covid-data.csv’


2020-05-16 06:49:45 (3.30 MB/s) - ‘owid-covid-data.csv’ saved [2488418/2488418]



## Understanding the dataset


In [4]:
owid_covid_dataframe = pd.read_csv(STORAGE_DIR+"owid-covid-data.csv")
owid_covid_dataframe.head()

Unnamed: 0,iso_code,location,date,total_cases,new_cases,total_deaths,new_deaths,total_cases_per_million,new_cases_per_million,total_deaths_per_million,new_deaths_per_million,total_tests,new_tests,total_tests_per_thousand,new_tests_per_thousand,tests_units,population,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cvd_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_100k
0,ABW,Aruba,2020-03-13,2,2,0,0,18.733,18.733,0.0,0.0,,,,,,106766.0,584.8,41.2,13.085,7.452,35973.781,,,11.62,,,,
1,ABW,Aruba,2020-03-20,4,2,0,0,37.465,18.733,0.0,0.0,,,,,,106766.0,584.8,41.2,13.085,7.452,35973.781,,,11.62,,,,
2,ABW,Aruba,2020-03-24,12,8,0,0,112.395,74.93,0.0,0.0,,,,,,106766.0,584.8,41.2,13.085,7.452,35973.781,,,11.62,,,,
3,ABW,Aruba,2020-03-25,17,5,0,0,159.227,46.831,0.0,0.0,,,,,,106766.0,584.8,41.2,13.085,7.452,35973.781,,,11.62,,,,
4,ABW,Aruba,2020-03-26,19,2,0,0,177.959,18.733,0.0,0.0,,,,,,106766.0,584.8,41.2,13.085,7.452,35973.781,,,11.62,,,,


In [0]:
def create_daily_feature_dict(dataframe, feature):
  country_cases = {}
  countries = dataframe.location.unique()
  dataframe = dataframe.dropna(subset=[feature])
  for country in countries:
    dict_value = np.array(dataframe[dataframe['location'] == country].sort_values(by=['date'])[['date',feature]])
    if(dict_value.size != 0):
      country_cases[country] = dict_value
  return country_cases

def save_dict_to_pickle(dict, pickle_file):
  with open(pickle_file, 'wb') as handle:
    pickle.dump(dict, handle, protocol=pickle.HIGHEST_PROTOCOL)

### Create a New Confirmed Cases Dictionary
* **Key: Country**
* **Value: array(list(date,new_cases_count))**

Saved to `COVID-19/our-world-in-data/country_confirmed_dict.csv` !

### Create a New Deaths Cases Dictionary
* **Key: Country**
* **Value: array(list(date,new_deaths_count))**

Saved to `COVID-19/our-world-in-data/country_deaths_dict.csv` !

### Create a New Tests Cases Dictionary
* **Key: Country**
* **Value: array(list(date,new_tests_count))**

Saved to `COVID-19/our-world-in-data/country_tests_dict.csv` !

In [0]:
country_daily_features = ['new_cases','new_deaths','new_tests']
for feature in country_daily_features:
  country_feature_dict = create_daily_feature_dict(owid_covid_dataframe,feature)
  save_dict_to_pickle(country_feature_dict, STORAGE_DIR+'{}_dict.pickle'.format(feature))

## Global Features For Each Country

* Next up, we work towards features that we need for the **class 2 model** we are trying out which is a  Total Cases Model where each instance of data resembles a country and its features and the prediction label are the number of total cases in the country.

* Our Word In Data dataset fits well for such case, and has a large corpus of global country features. In particular we extract the following features:

    1- **Total Number of Cases:** total number of confirmed found cases whether they are active, recovered or dead.

    2- **Population:** the number of individuals in a population. 

    3- **Population Density:** the average number of individuals per unit of area or volume.

In [0]:
def create_global_feature_dict(dataframe,feature):
  country_feature_dict = {}
  countries = dataframe.location.unique()
  for country in countries:
    dict_value = dataframe[dataframe['location'] == country][feature].dropna()
    if(dict_value.size != 0):
      country_feature_dict[country] = dict_value.iloc[-1]
  return country_feature_dict

In [0]:
country_features = ['total_cases','population','median_age','gdp_per_capita','hospital_beds_per_100k','total_deaths', 'total_tests']
for feature in country_features:
  country_feature_dict = create_global_feature_dict(owid_covid_dataframe,feature)
  save_dict_to_pickle(country_feature_dict, STORAGE_DIR+'country-features/{}_dict.pickle'.format(feature))

In [9]:
df[df.columns[3:]].corr()['total_cases'][:]

NameError: ignored