# Week 3 COVID-19 Prediction with Interpret_ML
This notebook will describe attempt at predicting the amount of Confirmed and Fatalities for the 3rd week of the COVID-19 Kaggle Competition, using models created from the [Interpret_ML toolbox](https://github.com/interpretml/interpret)

## Data Sources & Collection
We're using data that was collected or scraped from various sources, some of which are courtesy of work already done by other people that will be credited. Other data that we're presenting (and will be appending to the training data) are collected from multiple other sources, using some tools as can be seen in the Github page [here](). The list of sources as well as the sources that we'll be featuring in this notebook are listed here, namely:

1. [Worldometer Coronavirus page](https://www.worldometers.info/coronavirus/), which we believe contains the most updated information on the number of Confirmed and Fatalities that happen globally. As of 5 April, noted to have been updated to contain the latest amount of tests that happen globally, however noted that no time series for all countries are provided yet (in Worldometer itself).
2. Global climate Data from [Worldbank](https://datahelpdesk.worldbank.org/knowledgebase/articles/902061-climate-data-api). As explained a bit later in the notebook, we believe that a country's current climate condition might have a bit of effect on the spread of the virus.
3. [Our World in Data](https://ourworldindata.org/covid-testing), who has provided quite an updated time series for the recorded tests conducted by many countries for COVID-19. It is to be noted however, due to not all countries having released test data, only several countries could have their data imputed (and not by region)
4. [The COVID Tracking Project](https://covidtracking.com/ ), to specifically provide data COVID-19 testing that has so far been recorded in the US. It is noted and understood that this will only be helping mainly to predict the outcome in US and its region

## Short Introduction to InterpretML
[InterpretML](https://github.com/interpretml/interpret) is a Machine Learning toolbox developed by Microsoft Research, with the goal of giving better interpretability to trained Machine Learning models. For COVID-19 forecasting in particular, we believe that this toolbox will provide better understanding of the correlation between many different features and the model's prediction, hopefully helping in answering some of the [scientific questions](https://www.kaggle.com/c/covid19-global-forecasting-week-4/overview/open-scientific-questions) regardng the factors which effect COVID-19 transmission.

[TODO: summarize what InterpretML is, and provide some of the model examples that can be used from the InterpretML toolbox]

For this notebook, we'll create several models from the [InterpretML toolbox library](https://github.com/interpretml/interpret). These models will be trained using different sets of features (including the default features provided), which will then have their performances be compared to each other.  

## Loading of Interpret_ML.
First ensure that the Interpret_ML toolbox is installed with pip   

In [18]:
!pip install -U interpret

Requirement already up-to-date: interpret in /home/nick_sadjoli/.virtualenvs/covid19_forecast/lib/python3.6/site-packages (0.1.21)


## 1. Appending of the the Training Dataset with other Features
Now that Interpret_ML has been installed, let's first review and take note of the training and test data that has been provided by default, to see what features could be extracted for use later.

In [19]:
import pandas as pd 
import numpy as np 

train_default_path = "../input/train.csv"
test_default_path = "../input/test.csv"

train_default_data = pd.read_csv(train_default_path)
train_default_data

Unnamed: 0,Id,Country_Region,Province_State,Date,ConfirmedCases,Fatalities
0,1,Afghanistan,,2020-01-22,0.0,0.0
1,2,Afghanistan,,2020-01-23,0.0,0.0
2,3,Afghanistan,,2020-01-24,0.0,0.0
3,4,Afghanistan,,2020-01-25,0.0,0.0
4,5,Afghanistan,,2020-01-26,0.0,0.0
...,...,...,...,...,...,...
24409,35642,Zimbabwe,,2020-04-04,9.0,1.0
24410,35643,Zimbabwe,,2020-04-05,9.0,1.0
24411,35644,Zimbabwe,,2020-04-06,10.0,1.0
24412,35645,Zimbabwe,,2020-04-07,11.0,2.0


In [20]:
test_default_data = pd.read_csv(test_default_path)
test_default_data

Unnamed: 0,ForecastId,Country_Region,Province_State,Date
0,1,Afghanistan,,2020-04-02
1,2,Afghanistan,,2020-04-03
2,3,Afghanistan,,2020-04-04
3,4,Afghanistan,,2020-04-05
4,5,Afghanistan,,2020-04-06
...,...,...,...,...
13454,13455,Zimbabwe,,2020-05-10
13455,13456,Zimbabwe,,2020-05-11
13456,13457,Zimbabwe,,2020-05-12
13457,13458,Zimbabwe,,2020-05-13


From looking at these data, it can be seen that the number of previously known number of Confirmed and Fatalities would be the main default features that could be extracted and used. Based on expert opinions as well as various other works however, it seems that these features would not be sufficient in accurately predicting the total amount of Confirmed and Fatalities in the future.

Hence, additional data features would be required. In this notebook, several of the additional data features that we've collected can be seen below:

### 1.a. Weather features
Thanks to the work by David Bonin (Kaggle user [davidbn92](https://www.kaggle.com/davidbnn92)) in his [notebook](https://www.kaggle.com/davidbnn92/weather-data/output), a variation of the training data that has been appended with Weather/climate features of all regions has been provided. As noted in their page, these weather data are courtesy of [NOAA GSOD readings](https://www.kaggle.com/noaa/gsod), which has been appended to the training data.

In [21]:
train_appended_df = pd.read_csv("../input/training_data_with_weather_info_week_4.csv")
print("Current columns:", train_appended_df.columns)
train_appended_df

Current columns: Index(['Id', 'Country_Region', 'Province_State', 'Date', 'ConfirmedCases',
       'Fatalities', 'country+province', 'Lat', 'Long', 'day_from_jan_first',
       'temp', 'min', 'max', 'stp', 'slp', 'dewp', 'rh', 'ah', 'wdsp', 'prcp',
       'fog'],
      dtype='object')


Unnamed: 0,Id,Country_Region,Province_State,Date,ConfirmedCases,Fatalities,country+province,Lat,Long,day_from_jan_first,...,min,max,stp,slp,dewp,rh,ah,wdsp,prcp,fog
0,1,Afghanistan,,2020-01-22,0.0,0.0,Afghanistan-,33.000000,65.000000,22,...,33.6,54.9,999.9,1024.3,27.4,0.545709,0.186448,9.4,0.00,0
1,2,Afghanistan,,2020-01-23,0.0,0.0,Afghanistan-,33.000000,65.000000,23,...,32.7,55.9,999.9,1020.8,22.8,0.461259,0.163225,14.9,99.99,1
2,3,Afghanistan,,2020-01-24,0.0,0.0,Afghanistan-,33.000000,65.000000,24,...,36.9,43.2,999.9,1018.6,34.5,0.801794,0.325375,10.4,0.17,1
3,4,Afghanistan,,2020-01-25,0.0,0.0,Afghanistan-,33.000000,65.000000,25,...,37.9,56.3,999.9,1018.0,37.8,0.728175,0.214562,6.1,0.57,1
4,5,Afghanistan,,2020-01-26,0.0,0.0,Afghanistan-,33.000000,65.000000,26,...,36.1,53.1,999.9,1014.8,33.2,0.685513,0.231656,10.8,0.00,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24409,35642,Zimbabwe,,2020-04-04,9.0,1.0,Zimbabwe-,-17.829167,31.052222,95,...,66.2,80.6,999.9,,53.9,0.481730,0.130122,4.2,0.00,0
24410,35643,Zimbabwe,,2020-04-05,9.0,1.0,Zimbabwe-,-17.829167,31.052222,96,...,66.2,80.6,999.9,,53.9,0.481730,0.130122,4.2,0.00,0
24411,35644,Zimbabwe,,2020-04-06,10.0,1.0,Zimbabwe-,-17.829167,31.052222,97,...,66.2,80.6,999.9,,53.9,0.481730,0.130122,4.2,0.00,0
24412,35645,Zimbabwe,,2020-04-07,11.0,2.0,Zimbabwe-,-17.829167,31.052222,98,...,66.2,80.6,999.9,,53.9,0.481730,0.130122,4.2,0.00,0


In [22]:
training_data_unique_regions = train_appended_df['Province_State'].unique()
training_data_unique_regions, len(training_data_unique_regions)

(array([nan, 'Australian Capital Territory', 'New South Wales',
        'Northern Territory', 'Queensland', 'South Australia', 'Tasmania',
        'Victoria', 'Western Australia', 'Alberta', 'British Columbia',
        'Manitoba', 'New Brunswick', 'Newfoundland and Labrador',
        'Northwest Territories', 'Nova Scotia', 'Ontario',
        'Prince Edward Island', 'Quebec', 'Saskatchewan', 'Yukon', 'Anhui',
        'Beijing', 'Chongqing', 'Fujian', 'Gansu', 'Guangdong', 'Guangxi',
        'Guizhou', 'Hainan', 'Hebei', 'Heilongjiang', 'Henan', 'Hong Kong',
        'Hubei', 'Hunan', 'Inner Mongolia', 'Jiangsu', 'Jiangxi', 'Jilin',
        'Liaoning', 'Macau', 'Ningxia', 'Qinghai', 'Shaanxi', 'Shandong',
        'Shanghai', 'Shanxi', 'Sichuan', 'Tianjin', 'Tibet', 'Xinjiang',
        'Yunnan', 'Zhejiang', 'Faroe Islands', 'Greenland',
        'French Guiana', 'French Polynesia', 'Guadeloupe', 'Martinique',
        'Mayotte', 'New Caledonia', 'Reunion', 'Saint Barthelemy',
        'Saint 

As per noted by David in his work, the weather features that were added included the following:

- ```temp```: Mean temperature for the day in degrees Fahrenheit to tenths.
- ```max```: Maximum temperature reported during the day in Fahrenheit to tenths--time of max temp report varies by country and region, so this will sometimes not be the max for the calendar day.
- ```min```: Minimum temperature reported during the day in Fahrenheit to tenths--time of min temp report varies by country and region, so this will sometimes not be the min for the calendar day.
- ```stp```: Mean station pressure for the day in millibars to tenths.
- ```slp```: Mean sea level pressure for the day in millibars to tenths.
- ```dewp```: Mean dew point for the day in degrees Fahrenheit to tenths.
- ```wdsp```: Mean wind speed for the day in knots to tenths.
- ```prcp```: Total precipitation (rain and/or melted snow) reported during the day in inches and hundredths; will usually not end with the midnight observation--i.e., may include latter part of previous day. .00 indicates no measurable precipitation (includes a trace).
- ```fog```: Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day

The reason to include weather data for COVID-19 prediction would is because of some previous research (example of such paper [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2916580/)) linking the coronavirus family having [seasonality period](https://www.bbc.com/future/article/20200323-coronavirus-will-hot-weather-kill-covid-19), with indication that warmer weather could [slow down](https://www.theguardian.com/world/2020/apr/05/scientists-ask-could-summer-heat-help-beat-covid-19) the transmission of the virus. However, similarly there has been caution by health experts that this might not be [true](https://www.sciencenews.org/article/coronavirus-warm-weather-will-not-slow-covid-19-transmission). 

As such, we'll investigate using the InterpretML toolbox to see the correlation between any of these weather effect with COVID-19 forecasting.

### 1.b. Population Data 
Specifically, the Population Density for each region. Hypothetically, a region that has a higher population density should in theory have a higher chance of faster COVID-19 transmission. For consistency, we'll be mainly using the countries' and regions' population and population density data that was recorded by [Worldometer](https://www.worldometers.info/world-population/population-by-country/) from their respective country pages. 


In [23]:
population_df = pd.read_csv("../input/Worldometer_Population_Regional_Latest.csv")
population_df

Unnamed: 0.1,Unnamed: 0,#,Country (or dependency),Region,Population (2020),Yearly Change,Net Change,Density (P/Km²),Land Area (Km²),Migrants (net),Fert. Rate,Med. Age,Urban Pop %,World Share
0,0,1.0,China,All_Regions,1439323776,0.39 %,5540090,153,9388211,-348399,1.7,38,61 %,18.47 %
1,1,1.5,China,Nanchang,2357839,,,,,,,,,
2,1,1.5,China,Xi'an,6501190,,,,,,,,,
3,1,1.5,China,Guangzhou,11071424,,,,,,,,,
4,1,1.5,China,Lijiang,1137600,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7136,99,99.5,Serbia,Trstenik,49043,,,,,,,,,
7137,99,99.5,Serbia,Jagodina,35589,,,,,,,,,
7138,99,99.5,Serbia,Sremska Mitrovica,39084,,,,,,,,,
7139,99,99.5,Serbia,Pancevo,76654,,,,,,,,,


In [24]:
#population_df[population_df['Country (or dependency)'] == 'China']
#population_df[population_df['Country (or dependency)'] == 'China']['Region'] == 'All_Regions'
population_df_country = population_df[population_df['Country (or dependency)'] == 'China']
int(population_df_country[population_df_country['Region'] == "All_Regions"]['Density (P/Km²)'].values[0])
#population_df[population_df[population_df['Country (or dependency)'] == 'China']['Region'] == 'All_Regions']
#population_df_country = population_df[population_df['Country (or dependency)'] == country]['Region'] == 'All_Regions'

153

In [25]:
#list the unique regions in the population_df DataFrame, while also removing the 'All_Regions' tag (which indicate it's the population of the whole country, and not just a region)
popdf_unique_regions = population_df['Region'].unique()
popdf_unique_regions = np.sort(popdf_unique_regions[popdf_unique_regions != 'All_Regions'])
print("All {} unique regions recorded:".format(str(len(popdf_unique_regions))))
print(popdf_unique_regions, "True Victoria" in popdf_unique_regions)

All 6789 unique regions recorded:
["'Afak" "'Ajlun" "'Ali Sabieh" ... '`Izra' 'eMbalenhle' 'maalot Tarshiha'] False


(Note that 'All_regions' mean that the data shown in that particular row applies to the whole country, not just a particular region in that country)

However, as can be seen it is noted that Worldometer doesn't seem to provide the Population Density features recorded for regional levels. Hence, for countries with regions in the training data, we'll instead use the countries' and regions' population data as of 2019 provided by OECD on their [Region and Cities](https://stats.oecd.org/Index.aspx?DataSetCode=REGION_DEMOGR#) page. It is noted that these would not likley reflect the lates population density for all region/provinces in the training data. However, we believe the difference in population density for these regions between 2019 and 2020 should be minimal enough such that the the difference should be rather minimal. A reliable source that could help with this would be helpful as an input/feedback.

Note that OECD divides the regions into 2 types: T2 (Large) and T3 (Small) Regions. Let's take a glimpse at the population density records for all region types in 2019 first.

In [26]:
population_density_area_df = pd.read_csv("../input/OECD_PopulationDensity_and_Area-T2_T3_Regions-2018_2019.csv")
print("Columns available:", population_density_area_df.columns)
population_density_area_df.head()

Columns available: Index(['TL', 'Territory Level and Typology', 'REG_ID', 'Region', 'VAR',
       'Indicator', 'SEX', 'Gender', 'POS', 'Position', 'TIME', 'Year',
       'Unit Code', 'Unit', 'PowerCode Code', 'PowerCode',
       'Reference Period Code', 'Reference Period', 'Value', 'Flag Codes',
       'Flags'],
      dtype='object')


Unnamed: 0,TL,Territory Level and Typology,REG_ID,Region,VAR,Indicator,SEX,Gender,POS,Position,...,Year,Unit Code,Unit,PowerCode Code,PowerCode,Reference Period Code,Reference Period,Value,Flag Codes,Flags
0,1,Country,AUT,Austria,POP_DEN,Population density (pop. per km2),T,Total,ALL,All regions,...,2018,RATIO,Ratio,0,Units,,,106.91,,
1,2,Large regions (TL2),AT21,Carinthia,POP_DEN,Population density (pop. per km2),T,Total,ALL,All regions,...,2018,RATIO,Ratio,0,Units,,,59.88,,
2,1,Country,BEL,Belgium,SURF,Regional surface,T,Total,ALL,All regions,...,2018,KM2,Square kilometres,0,Units,,,30451.0,,
3,2,Large regions (TL2),BE1,Brussels Capital Region,POP_DEN,Population density (pop. per km2),T,Total,ALL,All regions,...,2018,RATIO,Ratio,0,Units,,,7441.31,,
4,2,Large regions (TL2),DED,Saxony,POP_DEN,Population density (pop. per km2),T,Total,ALL,All regions,...,2018,RATIO,Ratio,0,Units,,,224.54,,


In [27]:
#Limit to only population density data, and in Year 2019 only
population_density_only = population_density_area_df[population_density_area_df["VAR"] == "POP_DEN"]
population_density_only.drop(['SEX', 'Gender', 'POS', 'Position', 'PowerCode Code', 'Reference Period Code', 'Reference Period'], axis=1)
population_density_2019 = population_density_only[population_density_only["Year"] == 2019]
population_density_unique_regions = population_density_2019['Region'].unique()
print("All unique {} regions recorded for OECD's population density data: ".format(str(len(population_density_unique_regions))), 
                                                                                    population_density_unique_regions)
population_density_2019.head()

All unique 2938 regions recorded for OECD's population density data:  ['Guerrero, R4' 'Jalisco' 'Mexico, R2' ... 'Altai Krai' 'Sud-Ouest'
 'Chelyabinsk Oblast']


Unnamed: 0,TL,Territory Level and Typology,REG_ID,Region,VAR,Indicator,SEX,Gender,POS,Position,...,Year,Unit Code,Unit,PowerCode Code,PowerCode,Reference Period Code,Reference Period,Value,Flag Codes,Flags
1099,3,Small regions (TL3),ME12R4,"Guerrero, R4",POP_DEN,Population density (pop. per km2),T,Total,ALL,All regions,...,2019,RATIO,Ratio,0,Units,,,52.31,,
1103,2,Large regions (TL2),ME14,Jalisco,POP_DEN,Population density (pop. per km2),T,Total,ALL,All regions,...,2019,RATIO,Ratio,0,Units,,,105.33,,
1105,3,Small regions (TL3),ME15R2,"Mexico, R2",POP_DEN,Population density (pop. per km2),T,Total,ALL,All regions,...,2019,RATIO,Ratio,0,Units,,,1877.13,,
1121,3,Small regions (TL3),ME17R5,"Morelos, R5",POP_DEN,Population density (pop. per km2),T,Total,ALL,All regions,...,2019,RATIO,Ratio,0,Units,,,142.19,,
1135,3,Small regions (TL3),ME19R3,"Nuevo Leon, R3",POP_DEN,Population density (pop. per km2),T,Total,ALL,All regions,...,2019,RATIO,Ratio,0,Units,,,2.95,,


In [28]:
segment = population_density_2019[population_density_2019['Region'] == 'Anhui']
segment['Value'].values[0]
#segment[segment['Territory Level and Typology'] == 'Country']['Value'].values
#segment.loc[]['Value']
#len(segment[segment['Territory Level and Typology'] == 'Country'])

451.31

In [29]:
country_segment = train_appended_df[train_appended_df['Country_Region'] == "Mexico"]
list(country_segment['Province_State'].unique()) == [np.NaN]

True

In [30]:
country_segment = train_test[train_test['Country_Region'] == 'China']
region_segment = country_segment[country_segment['Province_State'] == 'Anhui']
country_segment

NameError: name 'train_test' is not defined

Adding these population data into the modified training_data:

In [31]:
popdf_unique_regions

array(["'Afak", "'Ajlun", "'Ali Sabieh", ..., '`Izra', 'eMbalenhle',
       'maalot Tarshiha'], dtype=object)

In [32]:
#train_appended_df = train_appended_df.copy()
#Initiate new feature columns
added_features = ['Population (2020)', 'Population Density']
for feature in added_features:
    train_appended_df[feature] = 0

for country in train_appended_df['Country_Region'].unique():
    #print(train_appended_df['Population (2020)'].unique())
    country_segment = train_appended_df[train_appended_df['Country_Region'] == country]

    #Sanity check for several countries, as they're apparently named quite differently in Worldometers vs the training data
    if country == "Burma":
        country = "Myanmar" #Burma in Training data is actually Myanmar. History stuff I guess?
    elif country == "Korea, South":
        country = "South Korea" #this one is honestly just trolling at this point...

    population_df_country = population_df[population_df['Country (or dependency)'] == country]#['Region'] == 'All_Regions'

    #check whehter the current country has any listed states/regions in the original training data.
    #If yes: Add regional population and regional population density data
    #If not: Only add country population and population density data.
    country_regions_training = list(country_segment['Province_State'].unique())

    #Apparently there are 2 'Congo'-s: Republic of Congo/Brazzaville, vs DEMOCRATIC Republic of Congo/Zaire (as how it's differentiated in Worldometers)
    if country == "Congo (Brazzaville)": 
        country = "Congo"
        country_regions_training = ["Brazzaville"]
    elif country == "Congo (Kinshasa)":
        country = "Congo"
        country_regions_training = ["Kinshasa"]

    if country_regions_training == [np.NaN]:
        #print(country, country_regions_training)
        
        #sanity check: in case country isn't listed in the worldometers population data, 
        #then query to the population_df would return DataFrame of 0
        if len(population_df_country) != 0:
            try:
                country_population = int(population_df_country[population_df_country['Region'] == "All_Regions"]["Population (2020)"].values[0].replace(",", "")) 
            except:
                print("Problematic country for pop. df", country, country_regions_training)
            try:
                country_population_density = population_df_country[population_df_country['Region'] == "All_Regions"]["Density (P/Km²)"].values[0]
            except:
                print("Problematic country for pop_density df", country, country_regions_training)
                break
        else:
            continue
            
        country_ids = country_segment.index.tolist()
        train_appended_df.loc[country_ids, ['Population (2020)']] = country_population
        train_appended_df.loc[country_ids, ['Population Density']] = country_population_density
    else:
        for region in country_regions_training:
            region_segment = country_segment[country_segment['Province_State'] == region]

            #sanity check, as apparently the region names are not truly unique to a country in Worldometer's data
            #(in particular, the region 'Victoria' which is unique to Australia in training data, is not present in Worldometer's Australia,
            # and instead available for other countries.)
            region_popdf_segment = population_df_country[population_df_country['Region'] == region]
            region_popdensity_segment = population_density_2019[population_density_2019['Region'] == region]

            if len(region_popdf_segment) == 1: #Means that there is a valid row available in Worldometer's population_data
                region_population = int(region_popdf_segment["Population (2020)"].values[0].replace(",", "") )
                #region_population = int(population_df_country[population_df_country['Region'] == region]["Population (2020)"].values[0].replace(",", ""))
            else:
                region_population = np.NaN

            if len(region_popdensity_segment) == 1:
                region_population_density = population_density_2019[population_density_2019['Region'] == region]['Value'].values[0]
            else:
                region_population_density = np.NaN

            region_ids = region_segment.index.tolist()
            train_appended_df.loc[region_ids, ['Population (2020)']] = region_population
            train_appended_df.loc[region_ids,['Population Density']] = region_population_density
train_appended_df
#print("Done")

Unnamed: 0,Id,Country_Region,Province_State,Date,ConfirmedCases,Fatalities,country+province,Lat,Long,day_from_jan_first,...,stp,slp,dewp,rh,ah,wdsp,prcp,fog,Population (2020),Population Density
0,1,Afghanistan,,2020-01-22,0.0,0.0,Afghanistan-,33.000000,65.000000,22,...,999.9,1024.3,27.4,0.545709,0.186448,9.4,0.00,0,38928346.0,60
1,2,Afghanistan,,2020-01-23,0.0,0.0,Afghanistan-,33.000000,65.000000,23,...,999.9,1020.8,22.8,0.461259,0.163225,14.9,99.99,1,38928346.0,60
2,3,Afghanistan,,2020-01-24,0.0,0.0,Afghanistan-,33.000000,65.000000,24,...,999.9,1018.6,34.5,0.801794,0.325375,10.4,0.17,1,38928346.0,60
3,4,Afghanistan,,2020-01-25,0.0,0.0,Afghanistan-,33.000000,65.000000,25,...,999.9,1018.0,37.8,0.728175,0.214562,6.1,0.57,1,38928346.0,60
4,5,Afghanistan,,2020-01-26,0.0,0.0,Afghanistan-,33.000000,65.000000,26,...,999.9,1014.8,33.2,0.685513,0.231656,10.8,0.00,1,38928346.0,60
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24409,35642,Zimbabwe,,2020-04-04,9.0,1.0,Zimbabwe-,-17.829167,31.052222,95,...,999.9,,53.9,0.481730,0.130122,4.2,0.00,0,14862924.0,38
24410,35643,Zimbabwe,,2020-04-05,9.0,1.0,Zimbabwe-,-17.829167,31.052222,96,...,999.9,,53.9,0.481730,0.130122,4.2,0.00,0,14862924.0,38
24411,35644,Zimbabwe,,2020-04-06,10.0,1.0,Zimbabwe-,-17.829167,31.052222,97,...,999.9,,53.9,0.481730,0.130122,4.2,0.00,0,14862924.0,38
24412,35645,Zimbabwe,,2020-04-07,11.0,2.0,Zimbabwe-,-17.829167,31.052222,98,...,999.9,,53.9,0.481730,0.130122,4.2,0.00,0,14862924.0,38


In [17]:
train_appended_df.to_csv("../input/train_feature_appended.csv")

In [None]:
for country in train_appended_df['Country_Region']:
    train_appended_df['Population (2020)'] = 



### 1. Default Features
First, we'll create a model that is trained just using the default features provided by the training data. 

### 1.c. Testing Data
Testing is a very important step in detecting and (hopefully) reducing the spread of COVID-19. In particular, development of *rapid* and *accurate* tests, that is *highly accessible* to the public has been [touted](https://www.heart.org/en/news/2020/04/02/covid-19-science-why-testing-is-so-important) by [a lot](https://www.id-hub.com/2020/04/02/the-importance-of-diagnostic-testing-for-covid-19/) of [experts](https://www.weforum.org/agenda/2020/04/united-states-coronavirus-bill-gates/) as one of the critical key steps that a country should focus on, to combat the COVID-19 spread.

As such, we suspect a high correlation between the amount of tests (as well as the test accuracy) against the forecasting/prediciton capabilities a model could have for COVID-19, and decided to also include the test data that has been provided by a variety of countries. Of course, it could be argued that increasing amount of tests would logically also increase the amount of *detected* confirmed cases, however faster tests should hypothetically allow for less amount of fatalities since faster and better handling of the confirmed patients should theoretically be doable.

For this, we'll be 