# COVID19 Global Forecasting (Week 5)
Forecast daily COVID-19 spread in regions around world

------------------------------------------------------------------------------------------------------------------

# Workflow 

- Opdracht omschrijving
- Importeer test en trainingset
- Data preprocessing & Structural changes

------------------------------------------------------------------------------------------------------------------

## Opdracht omschrijving

#### Background
The White House Office of Science and Technology Policy (OSTP) pulled together a coalition research groups and companies (including Kaggle) to prepare the COVID-19 Open Research Dataset (CORD-19) to attempt to address key open scientific questions on COVID-19. Those questions are drawn from National Academies of Sciences, Engineering, and Medicine’s (NASEM) and the World Health Organization (WHO).

#### The Challenge
Kaggle is launching a companion COVID-19 forecasting challenges to help answer a subset of the NASEM/WHO questions. While the challenge involves developing quantile estimates intervals for confirmed cases and fatalities between May 12 and June 7 by region, the primary goal isn't only to produce accurate forecasts. It’s also to identify factors that appear to impact the transmission rate of COVID-19.

You are encouraged to pull in, curate and share data sources that might be helpful. If you find variables that look like they impact the transmission rate, please share your finding in a notebook.

As the data becomes available, we will update the leaderboard with live results based on data made available from the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE).

We have received support and guidance from health and policy organizations in launching these challenges. We're hopeful the Kaggle community can make valuable contributions to developing a better understanding of factors that impact the transmission of COVID-19.

------------------------------------------------------------------------------------------------------------------

## Importeer test en trainingset

In [1]:
# !pip install pandas

In [2]:
# !pip install seaborn

In [3]:
# !pip install sklearn

In [4]:
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

In [5]:
# data import
train_df = pd.read_csv('./data/train.csv')
test_df = pd.read_csv('./data/test.csv')

In [6]:
# show columns
print(train_df.columns.values)

['Id' 'County' 'Province_State' 'Country_Region' 'Population' 'Weight'
 'Date' 'Target' 'TargetValue']


In [7]:
# null-values
train_df.info()
print('_'*40)
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 810342 entries, 0 to 810341
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Id              810342 non-null  int64  
 1   County          735462 non-null  object 
 2   Province_State  766584 non-null  object 
 3   Country_Region  810342 non-null  object 
 4   Population      810342 non-null  int64  
 5   Weight          810342 non-null  float64
 6   Date            810342 non-null  object 
 7   Target          810342 non-null  object 
 8   TargetValue     810342 non-null  float64
dtypes: float64(2), int64(2), object(5)
memory usage: 55.6+ MB
________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 311670 entries, 0 to 311669
Data columns (total 8 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   ForecastId      311670 non-null  int64  
 1   County          282870 non-

In [8]:
train_df.describe()

Unnamed: 0,Id,Population,Weight,TargetValue
count,810342.0,810342.0,810342.0,810342.0
mean,484797.5,2720127.0,0.53087,10.466348
std,279911.126702,34777710.0,0.451909,269.202775
min,1.0,86.0,0.047491,-10034.0
25%,242376.25,12133.0,0.096838,0.0
50%,484797.5,30531.0,0.349413,0.0
75%,727218.75,105612.0,0.968379,0.0
max,969594.0,1395773000.0,2.239186,36163.0


In [9]:
# preview the data
train_df.head(2000)

Unnamed: 0,Id,County,Province_State,Country_Region,Population,Weight,Date,Target,TargetValue
0,1,,,Afghanistan,27657145,0.058359,2020-01-23,ConfirmedCases,0.0
1,2,,,Afghanistan,27657145,0.583587,2020-01-23,Fatalities,0.0
2,3,,,Afghanistan,27657145,0.058359,2020-01-24,ConfirmedCases,0.0
3,4,,,Afghanistan,27657145,0.583587,2020-01-24,Fatalities,0.0
4,5,,,Afghanistan,27657145,0.058359,2020-01-25,ConfirmedCases,0.0
...,...,...,...,...,...,...,...,...,...
1995,2364,,Australian Capital Territory,Australia,426709,0.771375,2020-03-24,Fatalities,0.0
1996,2365,,Australian Capital Territory,Australia,426709,0.077138,2020-03-25,ConfirmedCases,0.0
1997,2366,,Australian Capital Territory,Australia,426709,0.771375,2020-03-25,Fatalities,0.0
1998,2367,,Australian Capital Territory,Australia,426709,0.077138,2020-03-26,ConfirmedCases,14.0


-----------------------------------------------------------------------------------------------------------------

### opmerkingen

- De eerste kolom is te verwaarlozen omdat het alleen zorgt voor een unique identifier voor die regel.


- De kolommen ['County', 'Province_State', 'Country_Region'] geeft samen de locatie, omdat deze locaties in deze dataset geen relatie tot elkaar hebben kunnen we hier een gecombineerde locatie_id van maken


- Target kan 1 van de volgende 2 waarde bevatten ['ConfirmedCases', 'Fatalities']


- TargetValue is een maatstaf om Weight van landen met verschillende populatie getallen met elkaar te kunnen vergelijken. Het drukt het aantal 'Fatalities' & 'COnfirmedCases' uit in een percentage van de populatie



-----------------------------------------------------------------------------------------------------------------

In [10]:
train_df = train_df.replace(np.nan, '', regex=True)

In [11]:
# preview the data
train_df.head()

Unnamed: 0,Id,County,Province_State,Country_Region,Population,Weight,Date,Target,TargetValue
0,1,,,Afghanistan,27657145,0.058359,2020-01-23,ConfirmedCases,0.0
1,2,,,Afghanistan,27657145,0.583587,2020-01-23,Fatalities,0.0
2,3,,,Afghanistan,27657145,0.058359,2020-01-24,ConfirmedCases,0.0
3,4,,,Afghanistan,27657145,0.583587,2020-01-24,Fatalities,0.0
4,5,,,Afghanistan,27657145,0.058359,2020-01-25,ConfirmedCases,0.0


In [12]:
train_df['location_id'] = train_df['Country_Region'] + train_df['Province_State'] + train_df['County']

In [13]:
train_df = train_df.drop(['Province_State','County','Country_Region','Id', 'Population'], axis=1)

In [14]:
# preview the data
train_df.head(1900)

Unnamed: 0,Weight,Date,Target,TargetValue,location_id
0,0.058359,2020-01-23,ConfirmedCases,0.0,Afghanistan
1,0.583587,2020-01-23,Fatalities,0.0,Afghanistan
2,0.058359,2020-01-24,ConfirmedCases,0.0,Afghanistan
3,0.583587,2020-01-24,Fatalities,0.0,Afghanistan
4,0.058359,2020-01-25,ConfirmedCases,0.0,Afghanistan
...,...,...,...,...,...
1895,0.771375,2020-02-03,Fatalities,0.0,AustraliaAustralian Capital Territory
1896,0.077138,2020-02-04,ConfirmedCases,0.0,AustraliaAustralian Capital Territory
1897,0.771375,2020-02-04,Fatalities,0.0,AustraliaAustralian Capital Territory
1898,0.077138,2020-02-05,ConfirmedCases,0.0,AustraliaAustralian Capital Territory


In [15]:
train_df_cases = train_df.loc[train_df['Target'] == 'ConfirmedCases']

In [16]:
train_df_fatalities = train_df.loc[train_df['Target'] == 'Fatalities']

In [17]:
train_df_cases.shape, train_df_fatalities.shape

((405171, 5), (405171, 5))

In [18]:
train_df_cases.set_index(["location_id", "Date"], inplace = True)

In [19]:
train_df_fatalities.set_index(["location_id", "Date"], inplace = True)

In [20]:
train_df_cases = train_df_cases.add_prefix('cas_')

In [21]:
train_df_fatalities = train_df_fatalities.add_prefix('fat_')

In [22]:
result = pd.concat([train_df_cases, train_df_fatalities], axis=1, sort=False)

In [23]:
result = train_df.drop(['Province_State','County','Country_Region','Id', 'Population'], axis=1)

KeyError: "['Province_State' 'County' 'Country_Region' 'Id' 'Population'] not found in axis"