# Week 3 COVID-19 Prediction with Interpret_ML
This notebook will describe attempt at predicting the amount of Confirmed and Fatalities for the 3rd week of the COVID-19 Kaggle Competition, using models created from the [Interpret_ML toolbox](https://github.com/interpretml/interpret)

## Data Sources & Collection
We're using data that was collected or scraped from various sources, some of which are courtesy of work already done by other people that will be credited. Other data that we're presenting (and will be appending to the training data) are collected from multiple other sources, using some tools as can be seen in the Github page [here](). The list of sources as well as the sources that we'll be featuring in this notebook are listed here, namely:

1. [Worldometer Coronavirus page](https://www.worldometers.info/coronavirus/), which we believe contains the most updated information on the number of Confirmed and Fatalities that happen globally. As of 5 April, noted to have been updated to contain the latest amount of tests that happen globally, however noted that no time series for all countries are provided yet (in Worldometer itself).
2. Global climate Data from [Worldbank](https://datahelpdesk.worldbank.org/knowledgebase/articles/902061-climate-data-api). As explained a bit later in the notebook, we believe that a country's current climate condition might have a bit of effect on the spread of the virus.
3. [Our World in Data](https://ourworldindata.org/covid-testing), who has provided quite an updated time series for the recorded tests conducted by many countries for COVID-19. It is to be noted however, due to not all countries having released test data, only several countries could have their data imputed (and not by region)
4. [The COVID Tracking Project](https://covidtracking.com/ ), to specifically provide data COVID-19 testing that has so far been recorded in the US. It is noted and understood that this will only be helping mainly to predict the outcome in US and its region

## Loading of Interpret_ML.
First ensure that the Interpret_ML toolbox is installed with pip   

In [2]:
!pip install -U interpret

ckages (from interpret-core[dash,debug,decisiontree,ebm,lime,linear,notebook,plotly,required,sensitivity,shap,treeinterpreter]>=0.1.21->interpret) (0.1.1.37)


## 1. Appending of the the Training Dataset with other Features
Now that Interpret_ML has been installed, let's first review and take note of the training and test data that has been provided by default, to see what features could be extracted for use later.

In [3]:
import pandas as pd 
import numpy as np 

train_default_path = "./data/train.csv"
test_default_path = "./data/test.csv"

train_default_data = pd.read_csv(train_default_path)
train_default_data

Unnamed: 0,Id,Province_State,Country_Region,Date,ConfirmedCases,Fatalities
0,1,,Afghanistan,2020-01-22,0.0,0.0
1,2,,Afghanistan,2020-01-23,0.0,0.0
2,3,,Afghanistan,2020-01-24,0.0,0.0
3,4,,Afghanistan,2020-01-25,0.0,0.0
4,5,,Afghanistan,2020-01-26,0.0,0.0
...,...,...,...,...,...,...
23251,32707,,Zimbabwe,2020-04-02,9.0,1.0
23252,32708,,Zimbabwe,2020-04-03,9.0,1.0
23253,32709,,Zimbabwe,2020-04-04,9.0,1.0
23254,32710,,Zimbabwe,2020-04-05,9.0,1.0


In [4]:
test_default_data = pd.read_csv(test_default_path)
test_default_data

Unnamed: 0,ForecastId,Province_State,Country_Region,Date
0,1,,Afghanistan,2020-03-26
1,2,,Afghanistan,2020-03-27
2,3,,Afghanistan,2020-03-28
3,4,,Afghanistan,2020-03-29
4,5,,Afghanistan,2020-03-30
...,...,...,...,...
13153,13154,,Zimbabwe,2020-05-03
13154,13155,,Zimbabwe,2020-05-04
13155,13156,,Zimbabwe,2020-05-05
13156,13157,,Zimbabwe,2020-05-06


From looking at these data, it can be seen that the number of previously known number of Confirmed and Fatalities would be the main default features that could be extracted and used. Based on expert opinions as well as various other works however, it seems that these features would not be sufficient in accurately predicting the total amount of Confirmed and Fatalities in the future.

Hence, additional data features would be required. In this notebook, several of the additional data features that we've collected can be seen below:

### 1.a. Weather features
Thanks to the work by [davidbnn92](https://www.kaggle.com/davidbnn92/weather-data/output), a variation of the training data that has been appended with Weather/climate features of all regions has been provided. As noted in their page, these weather data are courtesy of [NOAA GSOD readings](https://www.kaggle.com/noaa/gsod), which has been appended to the trianing data.



b

## Short Introduction to InterpretML
InterpretML is ... 

[TODO: summarize what InterpretML is, and provide some of the model examples that can be used from the InterpretML toolbox]

For this notebook, we'll create several models from the [InterpretML toolbox library](https://github.com/interpretml/interpret). These models will be trained using different sets of features (including the default features provided), which will then have their performances be compared to each other.  

### 1. Default Features
First, we'll create a model that is trained just using the default features provided by the training data. 