## Introduction
Kaggle is launching a COVID-19 forecasting challenge to help answer a subset of the National Academies of Sciences, Engineering, and Medicine’s (NASEM) and the World Health Organization (WHO) questions on COVID-19. In this challenge, we will be predicting the daily number of confirmed COVID19 cases in various locations across the world, as well as the number of resulting fatalities, for future dates. In this notebook, we will use COVID19 Global Forecasting (Week 5) dataset including the train, test and submission csv files. First, we will perform data analysis to identify the factors that impact the transmission rate of COVID-19. Afterwards, we will analyze the the effect of COVID-19 in India. Afterwards, we will use XGBoost and Random Forest regressor as ensemble learning models as well as to predict the daily number of confirmed COVID19 cases as well as the number of resulting fatalities in various locations across the world.

## Modeling Goal
I decided early on to not approach this as I usually build forecasting models. Reason is we are modeling a physical phenomenon where people get infected, then infect others for some time, then either recover or die. I therefore studied compartmental models used in epidemiology, SIR and the like. These models rely on two time series: cases and recoveries/deaths. If we have accurate values for both then we can fit these models and get reasonably accurate predictions.

Issue is we don't have these series.

For cases we have a proxy, confirmed cases. This is a proxy in many ways:

* It depends on the testing policy of each geography. Some test a lot, and confirmed cases are close to all cases.
* A large fraction of sick people are asymptomatic, hence are easily missed by testing.
* Testing does not happen when people get infected or contagious, it often happens with a delay.For all these reasons the confirmed case nubers we get is a distorted view of actual cases.

For fatalities the numbers aren't accurate either;

* In some geos we only get deaths test at hospital, in other geos it includes fatalities from nursing homes.
* We don't have recoveries data.The latter can be fixed by grabbing recovery data from other online source. This has been done by some top competitors, I wish I had done it.

Despite all these caveat, I assumed that we still have some form of SIR model at play with the two series we have at hand: fatalities depend on cases detected some while ago. That led to my first model.

![](https://cdn.futura-sciences.com/buildsv6/images/wide1920/2/a/3/2a354825f1_50171817_variant-covid19-coronavirus-epidemie.jpg)

In [None]:
!pip install dataprep

In [None]:
!pip install plotly

# 1. import library and package

In [None]:
# manipulation data
import pandas as pd
import numpy as np

#visualiation data
import matplotlib.pyplot as plt
import seaborn as sns 
import matplotlib
import plotly.graph_objects as go
import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, iplot

#default theme
sns.set(context='notebook', style='darkgrid', palette='Spectral', font='sans-serif', font_scale=1, rc=None)
matplotlib.rcParams['figure.figsize'] =[8,8]
matplotlib.rcParams.update({'font.size': 15})
matplotlib.rcParams['font.family'] = 'sans-serif'

# dataprep library
from dataprep.eda import *
from dataprep.datasets import load_dataset
from dataprep.eda import create_report

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, KFold
from sklearn import ensemble
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn import metrics

## load data

In [None]:
df=pd.read_csv('../input/covid19-global-forecasting-week-5/train.csv')
test=pd.read_csv('../input/covid19-global-forecasting-week-5/test.csv')
sub=pd.read_csv('../input/covid19-global-forecasting-week-5/submission.csv')

In [None]:
df

# 2. data analysis

In [None]:
df.info()

In [None]:
df.shape

In [None]:
df.dtypes.value_counts().plot.pie(explode=[0.1,0.1,0.1],autopct='%1.1f%%',shadow=True)
plt.title('data type');

1. our data had (969640 Rows, 9 Columns)
2. like we see :
    * more then 55% our data is **object** type 
    * 33% is integer 
    * 11% float

In [None]:
df.describe(include='all')

what we can see from the describtion :
* most of Country_Region case are in US with 895440 
* most of Province_State case are in Texas with 71400 
* most of County cases are in Washington with 8680 
* the most case are at the date of 2020-05-20 with 6926 
* the moste Target are  Fatalities with 484820 

# 3. finding missing values

### A.train data

In [None]:
missing = df.isnull().sum()
missing_pourcent = df.isnull().sum()/df.shape[0]*100

dic = {
    'mising':missing,
    'missing_pourcent %':missing_pourcent
}
frame=pd.DataFrame(dic)
frame

### B. test data

In [None]:
missing = test.isnull().sum()
missing_pourcent = test.isnull().sum()/df.shape[0]*100

dic = {
    'mising':missing,
    'missing_pourcent %':missing_pourcent
}
frame=pd.DataFrame(dic)
frame

### C. submission

In [None]:
sub

# Data Visualization

In [None]:
df.hist(figsize=(15,15),edgecolor='black');

## TargetValue

In [None]:
plot(df.TargetValue)

In [None]:
fig = px.pie(df, values='TargetValue', names='Target')
fig.update_traces(textposition='inside')
fig.update_layout(uniformtext_minsize=12, uniformtext_mode='hide')
fig.show()

In [None]:
fig = px.pie(df, values='TargetValue', names='Country_Region')
fig.update_traces(textposition='inside')
fig.update_layout(uniformtext_minsize=12, uniformtext_mode='hide')
fig.show()

## A. County

In [None]:
plot(df.County)

In [None]:
plt.figure(figsize=(30,9))
county_plot=df.County.value_counts().head(100)
sns.barplot(county_plot.index,county_plot)
plt.xticks(rotation=90)
plt.title('County count');


## B. Province_State

In [None]:
plot(df.Province_State)

In [None]:
plt.figure(figsize=(30,9))
Province_State_plot=df.Province_State.value_counts().head(100)
sns.barplot(Province_State_plot.index,Province_State_plot)
plt.xticks(rotation=90)
plt.title('Province State count');

## C. Country_Region

In [None]:
plot(df.Country_Region)

In [None]:
plt.figure(figsize=(30,9))
Country_Region_plot=df.Country_Region.value_counts().head(30)
sns.barplot(Country_Region_plot.index,Country_Region_plot)
plt.xticks(rotation=90)
plt.title('Country Region count');

In [None]:
confirmed=df[df['Target']=='ConfirmedCases']
fig = px.treemap(confirmed, path=['Country_Region'], values='TargetValue',width=900, height=600)
fig.update_traces(textposition='middle center', textfont_size=15)
fig.update_layout(
    title={
        'text': 'Total Share of Worldwide COVID19 Confirmed Cases',
        'y':0.92,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show()

In [None]:
dead=df[df['Target']=='Fatalities']
fig = px.treemap(dead, path=['Country_Region'], values='TargetValue',width=900,height=600)
fig.update_traces(textposition='middle center', textfont_size=15)
fig.update_layout(
    title={
        'text': 'Total Share of Worldwide COVID19 Fatalities',
        'y':0.92,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show()

In [None]:
fig = px.treemap(df, path=['Country_Region'], values='TargetValue',
                  color='Population', hover_data=['Country_Region'],
                  color_continuous_scale='matter', title='Current share of Worldwide COVID19 Confirmed Cases')
fig.show()

In [None]:
df.Population.value_counts()

In [None]:
df.columns

## D. Target

In [None]:
df.Target.value_counts()

In [None]:
df.Target.value_counts().plot.pie(explode=[0.1,0.1],autopct='%1.1f%%',shadow=True)

## Date

In [None]:
last_date = df.Date.max()
df_countries = df[df['Date']==last_date]
df_countries = df_countries.groupby('Country_Region', as_index=False)['TargetValue'].sum()
df_countries = df_countries.nlargest(10,'TargetValue')
df_trend = df.groupby(['Date','Country_Region'], as_index=False)['TargetValue'].sum()
df_trend = df_trend.merge(df_countries, on='Country_Region')
df_trend.rename(columns={'Country_Region':'Country', 'TargetValue_x':'Cases'}, inplace=True)

In [None]:
px.line(df_trend, x='Date', y='Cases', color='Country', title='COVID19 Total Cases growth for top 10 worst affected countries')

# 4. Data Preprocessing

We would drop some features Who have many Null values and not that much important.

In [None]:
df = df.drop(['County','Province_State','Country_Region','Target'],axis=1)
test = test.drop(['County','Province_State','Country_Region','Target'],axis=1)
df

we gonna cheech if we had i Null values

In [None]:
df.isnull().sum()

1. first we gonna create features
2. then we gonna train_dev_split

In [None]:
def create_features(df):
    df['day'] = df['Date'].dt.day
    df['month'] = df['Date'].dt.month
    df['dayofweek'] = df['Date'].dt.dayofweek
    df['dayofyear'] = df['Date'].dt.dayofyear
    df['quarter'] = df['Date'].dt.quarter
    df['weekofyear'] = df['Date'].dt.weekofyear
    return df

In [None]:
def train_dev_split(df, days):
    #Last days data as dev set
    date = df['Date'].max() - dt.timedelta(days=days)
    return df[df['Date'] <= date], df[df['Date'] > date]

In [None]:
test_date_min = test['Date'].min()
test_date_max = test['Date'].max()

In [None]:
def avoid_data_leakage(df, date=test_date_min):
    return df[df['Date']<date]

In [None]:
def to_integer(dt_time):
    return 10000*dt_time.year + 100*dt_time.month + dt_time.day
df['Date']=pd.to_datetime(df['Date'])
test['Date']=pd.to_datetime(test['Date'])

In [None]:
test['Date']=test['Date'].dt.strftime("%Y%m%d")
df['Date']=df['Date'].dt.strftime("%Y%m%d").astype(int)

# split data 

In [None]:

predictors = df.drop(['TargetValue', 'Id'], axis=1)
target = df["TargetValue"]
X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size = 0.22, random_state = 0)

# RandomForestRegressor

In [None]:
model = RandomForestRegressor(n_jobs=-1)
estimators = 100
model.set_params(n_estimators=estimators)

scores = []

pipeline = Pipeline([('scaler2' , StandardScaler()),
                        ('RandomForestRegressor: ', model)])
pipeline.fit(X_train , y_train)
prediction = pipeline.predict(X_test)

pipeline.fit(X_train, y_train)
scores.append(pipeline.score(X_test, y_test))

In [None]:
plt.figure(figsize=(8,6))
plt.plot(y_test,y_test,color='deeppink')
plt.scatter(y_test,prediction,color='dodgerblue')
plt.xlabel('Actual Target Value',fontsize=15)
plt.ylabel('Predicted Target Value',fontsize=15)
plt.title('Random Forest Regressor (R2 Score= 0.95)',fontsize=14)
plt.show()

In [None]:
X_test

In [None]:
# drop the ForecastId fro test data
test.drop(['ForecastId'],axis=1,inplace=True)
test.index.name = 'Id'
test

In [None]:
y_pred2 = pipeline.predict(X_test)
y_pred2

In [None]:
predictions = pipeline.predict(test)

pred_list = [int(x) for x in predictions]

output = pd.DataFrame({'Id': test.index, 'TargetValue': pred_list})
print(output)

In [None]:
output

# XGBoost Regressor

In [None]:
import xgboost as xgb

In [None]:
xgbr= xgb.XGBRegressor(n_estimators=800, learning_rate=0.01, gamma=0, subsample=.7,
                       colsample_bytree=.7, max_depth=10,
                       min_child_weight=0, 
                       objective='reg:squarederror', nthread=-1, scale_pos_weight=1,
                       seed=27, reg_alpha=0.00006, n_jobs=-1)

In [None]:
xgbr.fit(X_train,y_train)

In [None]:
prediction_xgbr=xgbr.predict(X_test)

In [None]:
print('RMSE_XGBoost Regression=', np.sqrt(metrics.mean_squared_error(y_test,prediction_xgbr)))
print('R2 Score_XGBoost Regression=',metrics.r2_score(y_test,prediction_xgbr))

In [None]:
plt.figure(figsize=(8,6))
plt.scatter(x=y_test, y=prediction_xgbr, color='dodgerblue')
plt.plot(y_test,y_test, color='deeppink')
plt.xlabel('Actual Target Value',fontsize=15)
plt.ylabel('Predicted Target Value',fontsize=15)
plt.title('XGBoost Regressor (R2 Score= 0.89)',fontsize=14)
plt.show()

# Submission

In [None]:
a=output.groupby(['Id'])['TargetValue'].quantile(q=0.05).reset_index()
b=output.groupby(['Id'])['TargetValue'].quantile(q=0.5).reset_index()
c=output.groupby(['Id'])['TargetValue'].quantile(q=0.95).reset_index()
a.columns=['Id','q0.05']
b.columns=['Id','q0.5']
c.columns=['Id','q0.95']
a=pd.concat([a,b['q0.5'],c['q0.95']],1)
a['q0.05']=a['q0.05'].clip(0,10000)
a['q0.5']=a['q0.5'].clip(0,10000)
a['q0.95']=a['q0.95'].clip(0,10000)
a['Id'] =a['Id']+ 1
a

In [None]:
sub=pd.melt(a, id_vars=['Id'], value_vars=['q0.05','q0.5','q0.95'])
sub['variable']=sub['variable'].str.replace("q","", regex=False)
sub['ForecastId_Quantile']=sub['Id'].astype(str)+'_'+sub['variable']
sub['TargetValue']=sub['value']
sub=sub[['ForecastId_Quantile','TargetValue']]
sub.reset_index(drop=True,inplace=True)
sub.head()

In [None]:
sub.to_csv("submission.csv",index=False)