# Day level information on covid-19 affected cases

> **Task I**: Predict the spreading of corona virus 
- Can we help mitigate the secondary effect of covid-19 by predicting its spread?

We are going to answers this question

## Part I

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
#we import all package that we need
import matplotlib.pyplot as plt
import seaborn as sns 
import statsmodels as sm
import folium as fl
from pathlib import Path
from sklearn.impute import SimpleImputer
sns.set()
%matplotlib inline
pd.options.plotting.backend
pd.plotting.register_matplotlib_converters()

## Data cleaning
In this part of notebook, I take two csv files covid_19_data and COVID19_open_line_list

**Clean data from covid_19_data.csv file**

In [None]:
dataFile = '/kaggle/input/novel-corona-virus-2019-dataset/covid_19_data.csv'
covid = pd.read_csv(dataFile)

In [None]:
#see data 
covid.head()

In [None]:
# data information
covid.info()

In [None]:
# check if there exist a missing value
mis = covid.isnull().sum()
mis[mis>0]

Only Province/State have a missing value. I can impute it because this variable is necessary for visualizing a data.

In [None]:
imputer = SimpleImputer(strategy='constant')#here I use constant because I cannot put another Province/State
#that we do not know or that does not correspond to his country/region  
impute_covid = pd.DataFrame(imputer.fit_transform(covid), columns=covid.columns)
impute_covid.head()

In [None]:
#convert ObservationDate and Last Update object to datetime
#convert confirmed, recovered, death to numeric
impute_covid['ObservationDate'] = pd.to_datetime(impute_covid['ObservationDate'])
impute_covid['Last Update'] = pd.to_datetime(impute_covid['Last Update'])
impute_covid['Confirmed'] = pd.to_numeric(impute_covid['Confirmed'], errors='coerce')
impute_covid['Recovered'] = pd.to_numeric(impute_covid['Recovered'], errors='coerce')
impute_covid['Deaths'] = pd.to_numeric(impute_covid['Deaths'], errors='coerce')

In [None]:
#check
#impute_covid.to_csv('/kaggle/input/novel-corona-virus-2019-dataset/covid_19_data_clean.csv', index=False)
impute_covid.info()

**clean data from file COVID19_open_line_list.csv**

In [None]:
filename = '/kaggle/input/novel-corona-virus-2019-dataset/COVID19_open_line_list.csv'
open_covid = pd.read_csv(filename)

In [None]:
open_covid.head(3)

In [None]:
open_covid.info()

In [None]:
# check missing value 
mis_value = open_covid.isnull().sum()
mis_value[mis_value>0]

In [None]:
thresh_value = open_covid['date_confirmation'].isnull().sum() # the threshold missing value I take here is 
#that of date_confirmation columns  because I do not drop many value for another columns
#we take only columns which a number missing value less than thresh_value
cols_interest_missing = [col for col in open_covid.columns if open_covid[col].isnull().sum()<=thresh_value]
cols_interest_missing

In [None]:
missing_covid = open_covid[cols_interest_missing].copy()

In [None]:
#we can drop a nan value
drop_covid = missing_covid.dropna(axis=0)
drop_covid.isnull().any()

In [None]:
#see data dropped
drop_covid.head()

In [None]:
# some anomalies detected in the data 
drop_covid['date_confirmation'][drop_covid['date_confirmation']=='25.02.2020-26.02.2020']

In [None]:
drop_covid['date_confirmation'][drop_covid['date_confirmation']!='25.02.2020-26.02.2020']

You can see that we have anomaly format in the date_confirmation columns. how can fix it?  to fix that, we need to see the format of date_confirmation around 11586 to 11810 

In [None]:
#verification 
drop_covid['date_confirmation'].iloc[11630:11750].values

In [None]:
# we fix this problem by remplacing '25.02.2020-26.02.2020' to '25.02.2020'
drop_covid = drop_covid.replace(to_replace='25.02.2020-26.02.2020', value='25.02.2020')

In [None]:
#convert date_confirmation to datetime dtype
drop_covid.loc[:,'date_confirmation'] = pd.to_datetime(drop_covid['date_confirmation'], format="%d.%m.%Y")

In [None]:
#see table if all is ok
drop_covid.isnull().any()

In [None]:
#drop_covid.to_csv('/kaggle/input/novel-corona-virus-2019-dataset/COVID19_open_clean.csv', index=False)
drop_covid.head()

**we are finishing data cleaning and we will start visualization our data**

## Feature Statistics and Visualization

> **impute_covid**:
we are going to visualize this data and make some statistics to find the relevant information.

In [None]:
# see again data table
impute_covid.head(3)

In [None]:
# we compute the active_confirmed
impute_covid['active_confirmed'] = impute_covid['Confirmed'].values - \
(impute_covid['Deaths'].values+impute_covid['Recovered'].values)

In [None]:
#check if all is ok
impute_covid.isnull().sum()[impute_covid.isnull().sum()>0]

In [None]:
#ok we have no problem see table data
impute_covid.info()

**Correlation between feature over the time**

In [None]:
impute_covid.corr()

We see that feature: 
- Confirmed and Deaths are most correlated
- Recovered and Deaths are more correlated
- Confirmed and Recovered are more less correlated
- active confirmed and confirmed are more correlated

we can check it in the figure below

In [None]:
features = [['Confirmed', 'Deaths'], ['Confirmed', 'Recovered'], ['Recovered', 'Deaths'], \
            ['Confirmed', 'active_confirmed']]
values = [[impute_covid['Confirmed'], impute_covid['Deaths']],\
          [impute_covid['Confirmed'], impute_covid['Recovered']],\
          [impute_covid['Recovered'], impute_covid['Deaths']],\
          [impute_covid['Confirmed'], impute_covid['active_confirmed']]]

In [None]:
fig = plt.figure(figsize=(20.5,10.5))
fig.subplots_adjust(hspace=0.2, wspace=0.1)
for i in range(1,5):
    ax = fig.add_subplot(2, 2, i)
    col = features[i-1]
    val = values[i-1]
    ax.scatter(val[0], val[1])
    ax.set_xlabel(col[0])
    ax.set_ylabel(col[1])
    ax.set_title('Feature curves')
plt.show()

this graph give us clearly information and I can do some approximation according to the correlation result below.
- Confirmed feature ---> X
- Deaths feature ---> Y
- Recovered ---> Z

> We can write
- Y = f(X)
- Z = g(Y)

we have
> - Z = g(f(X)) 

finally

> - Z = gof(X). You understand why correlation between variable X and Z is 70.70%


**The Confirmed feature are a feature extremely important it can vary in the time. The next job is:**
- find how the Confirmed feature behave in the time or local time
- find how the Confirmed  feature behave in the different location.



### The Worldwide confirmed, recovered, death and active confirmed 

**Location:  day level information on novel covid-19**

In [None]:
start_date = impute_covid.ObservationDate.min()
end_date = impute_covid.ObservationDate.max()
print('Novel Covid-19 information:\n 1. Start date = {}\n 2. End date = {}'.format(start_date, end_date))

In [None]:
worldwide = impute_covid[impute_covid['ObservationDate'] == end_date]

In [None]:
nb_country = len(worldwide['Country/Region'].value_counts()) # number country
worldwide['Country/Region'].value_counts()

In [None]:
world = worldwide.groupby('Country/Region').sum()
world = world.sort_values(by=['Confirmed'], ascending=False)
world.head()

In [None]:
print('================ Worldwide report ===============================')
print('== Information to {} on novel COVID-19 =========\n'.format(end_date))
print('Tota confirmed: {}\nTotal Deaths: {}\nTotal Recovered: {}\nTotal active confirmed: {}\n\
Total country Recorded: {} \n'.format(\
worldwide.Confirmed.sum(), worldwide.Deaths.sum(), worldwide.Recovered.sum(), worldwide.active_confirmed.sum(),\
                                     nb_country))
print('==================================================================')

In [None]:
world.Confirmed.plot(kind='bar', title= 'novel Covid-19 in the Worldwide', figsize=(20,8), logy=True,legend=True)
plt.ylabel('Total Cases')

In [None]:
world.Recovered.plot(kind='bar', title= 'novel Covid-19 in the Worldwide', figsize=(20,8), logy=True,\
                     colormap='Greens_r', legend=True)
plt.ylabel('Total Recovered')

In [None]:
world.Deaths.plot(kind='bar', title= 'novel Covid-19 in the Worldwide', figsize=(20,8), logy=True,\
                     colormap='Reds_r', legend=True)
plt.ylabel('Total Deaths')

In [None]:
world.active_confirmed.plot(kind='bar', title= 'novel Covid-19 in the Worldwide', figsize=(20,8), logy=True,\
                            legend=True)
plt.ylabel('Total  Active Cases')

**In this part I plot the country are most affected by the novel covid-19 on graph and table for the other country**

In [None]:
world_table = world.reset_index()

In [None]:
x = world_table[world_table['Country/Region'] == 'France']
big_7 = world_table[world_table['Confirmed'] >= x.iloc[0,1]]

**# we see the seven country most affected by novel covid-19**

In [None]:
big_7.style.background_gradient(cmap='viridis')

In [None]:
axs = big_7.plot('Country/Region', ['Confirmed', 'Deaths', 'Recovered', 'active_confirmed'], kind='barh',\
                 stacked=True, title='Country most affected by novel covid-19',\
                 figsize=(20,10.5),colormap='rainbow_r', logx=True, legend=True) 
pd.plotting.table(data=world_table, rowLabels=world.index, colLabels=world.columns, ax=axs)
plt.xlabel(' ')

**Time**  

In [None]:
time_obs = impute_covid.groupby('ObservationDate')['Confirmed'].aggregate([np.sum])
time_obs.columns = ['Confirmed']

In [None]:
time_obs.plot(figsize=(20,8), title='novel COVID-19 in the Worldwide', kind='bar')
plt.ylabel('Total Confirmed observation')

In [None]:
death_rate = impute_covid.groupby('ObservationDate')['Deaths'].aggregate([np.sum])
recovered_rate = impute_covid.groupby('ObservationDate')['Recovered'].aggregate([np.sum])
activecase_rate = impute_covid.groupby('ObservationDate')['active_confirmed'].aggregate([np.sum])
death_rate.columns = ['Death rate']
recovered_rate.columns = ['Recovered rate']
activecase_rate.columns = ['Active confirmed rate']

In [None]:
recovered_rate.plot(figsize=(15.5, 5), title='novel COVID-19 in the Worldwide', colormap='Greens_r', kind='bar')
plt.ylabel('Total patient')

In [None]:
death_rate.plot(figsize=(15.5, 5), title='novel COVID-19 in the Worldwide', colormap='Reds_r', kind='bar')
plt.ylabel('Total patient')

In [None]:
activecase_rate.plot(figsize=(15.5, 5), title='novel COVID-19 in the Worldwide', colormap='Blues_r', kind='bar')
plt.ylabel('Total patient')

## Special China

COVID-19 come from this country that why I make attention to see the behaviour of this desease in that country.

In [None]:
china = impute_covid[impute_covid['Country/Region'] == 'Mainland China']

In [None]:
chstar_date = china.ObservationDate.min()
chend_date = china.ObservationDate.max()

In [None]:
print('Novel covid-19 China:\n start date = {}\n end date = {}'.format(chstar_date, chend_date))

In [None]:
lastChina = china[china['ObservationDate'] == chend_date]
lastChina.head()

In [None]:
print('================ China report ===================================')
print('== Information to {} on novel COVID-19 =========\n'.format(chend_date))
print('Tota confirmed: {}\nTotal Deaths: {}\nTotal Recovered: {}\nTotal active confirmed: {}\n'.format(\
lastChina.Confirmed.sum(), lastChina.Deaths.sum(), lastChina.Recovered.sum(), lastChina.active_confirmed.sum()))
print('==================================================================')

In [None]:
lastChina[['Province/State', 'Confirmed', 'Deaths', 'Recovered', 'active_confirmed']].style.\
background_gradient(cmap='viridis')

**Covid-19 into Province**
> - patient confirmed
> - patient recovered
> - patient death

In [None]:
province = lastChina.groupby('Province/State').sum()
province = province.sort_values(by=['Confirmed'], ascending=False)

In [None]:
province.plot(kind='bar', label='Confirmed',logy=True,figsize=(20,10), stacked=True,\
              title='China Province  with novel covid-19')
plt.ylabel('Total patient')

In [None]:
conf_china = china.groupby('ObservationDate')['Confirmed'].agg('sum')
rec_china = china.groupby('ObservationDate')['Recovered'].agg('sum')
dea_china = china.groupby('ObservationDate')['Deaths'].agg('sum')
ac_china = china.groupby('ObservationDate')['active_confirmed'].agg('sum')

In [None]:
conf_china.plot(figsize=(20,8), kind='bar',title='observationdate of patient confirmed in China')
plt.ylabel('Total patient')

In [None]:
rec_china.plot(figsize=(20,8), kind='bar',title='observationdate of patient recovered in China',\
               colormap='Greens_r')
plt.ylabel('Total patient')

In [None]:
dea_china.plot(figsize=(20,8), kind='bar',title='observationdate of patient death in China', colormap='Reds_r')
plt.ylabel('Total patient')

In [None]:
ac_china.plot(figsize=(20,8), kind='bar',title='observationdate of patient active confirmed in China')
plt.ylabel('Total patient')

## rest of the world

we are going to see the behavior of covid-19 in the rest of the world 

In [None]:
rest_world = impute_covid[impute_covid['Country/Region'] != 'Mainland China']

In [None]:
rest_world.head()

In [None]:
print('Novel covid-19 ROW:\n start date = {}\n end date = {}'.format(rest_world.ObservationDate.min(),\
                    rest_world.ObservationDate.max()))

In [None]:
row = rest_world[rest_world['ObservationDate'] == rest_world.ObservationDate.max()]

In [None]:
print('================ ROW report =====================================')
print('== Information to {} on novel COVID-19 =========\n'.format(chend_date))
print('Tota confirmed: {}\nTotal Deaths: {}\nTotal Recovered: {}\nTotal active confirmed: {}\n'.format(\
row.Confirmed.sum(), row.Deaths.sum(), row.Recovered.sum(), row.active_confirmed.sum()))
print('==================================================================')

In [None]:
rw = row[['Country/Region', 'Confirmed', 'Deaths', 'Recovered', 'active_confirmed']].\
groupby('Country/Region').sum()
rwx = rw.sort_values(by=['Confirmed'], ascending=False)
rwx.style.background_gradient(cmap='viridis')

In [None]:
rwx.plot(kind='bar', figsize=(20,10), stacked=True, title='novel covid-19 in the rest of world', logy=True)
plt.ylabel('Total patient')

In [None]:
obs_conf_world = rest_world.groupby('ObservationDate')['Confirmed'].aggregate([np.sum]) # confirmed obs
ac_conf_world = rest_world.groupby('ObservationDate')['active_confirmed'].aggregate([np.sum]) # last upd obs
patient_world_r = rest_world.groupby('ObservationDate')['Recovered'].aggregate([np.sum]) # lifetime 
patient_world_dea = rest_world.groupby('ObservationDate')['Deaths'].aggregate([np.sum]) # lifetime 

In [None]:
obs_conf_world.columns = ['Confirmed']
ac_conf_world.columns = ['active_onfirmed']
patient_world_r.columns = ['Recovered'] 
patient_world_dea.columns = ['Deaths'] 

In [None]:
obs_conf_world.plot(figsize=(20,8), title='novel covid-19 in the rest of the world',kind='bar')
plt.ylabel('total patient')

In [None]:
ac_conf_world.plot(figsize=(20,8), title="novel covid-19 in the rest of the world", kind='bar')
plt.ylabel('total patient')

In [None]:
patient_world_r.plot(figsize=(20,10.5), title='novel covid-19 in the rest of the world', kind='bar', \
                     colormap='Greens_r')
plt.ylabel('total patient')

In [None]:
patient_world_dea.plot(figsize=(20,10.5), title='novel covid-19 in the rest of the world', kind='bar', \
                     colormap='Reds_r')
plt.ylabel('total patient')

**Conclusion of Part I**

From this work, we remark that:
- Confirmed, Recovered and Death are two by two more correlated, we can do this approximation 
> - **Recovered = gof(Confirmed)**,  where **Death = f(Confirmed)** and **Recovered = g(Death)**

So, Confirmed feature is an important feature in this data. we can make a model based only on that feature. Confirmed feature depend on time.

We have seen qualitatively, how the COViD-19 is spreading in the World. The next part (part II),  we  find a model that predict the spread of covid-19 in the time.

**part II**
### Model I: Deep learning and time series  for predicting the spreading of novel covid-19 in world

In [None]:
from sklearn.preprocessing import MinMaxScaler
import datetime
from keras.layers.recurrent import GRU
from keras.layers import Dense, Input, Dropout 
from keras.optimizers import adam, rmsprop
from keras.models import Model 
from keras.models import load_model 
from keras.callbacks import ModelCheckpoint

In [None]:
confirmed_case = impute_covid[['ObservationDate', 'Confirmed']]
confirmed_case = confirmed_case.set_index('ObservationDate')

In [None]:
confirmed_case.head(3)

In [None]:
# violin boxplot
sns.violinplot(confirmed_case.Confirmed)
plt.title('confirmed case violin boxplot')

In [None]:
scaler = MinMaxScaler(feature_range=(0,1))
confirmed_case['scaled_cases']= scaler.fit_transform(np.array(confirmed_case.Confirmed).reshape(-1,1))

In [None]:
confirmed_case.head()

**ACF and PACF for confirmed feature**

In [None]:
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

In [None]:
fig = plt.figure(figsize=(15, 5.5))
fig.subplots_adjust(hspace=0.4, wspace=0.4)
ax1  = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)
plot_acf(confirmed_case.scaled_cases, ax=ax1, lags=10)
plot_pacf(confirmed_case.scaled_cases, ax=ax2, lags=10)
plt.show()

We see that pacf and acf give a same graph we can find p=3

In [None]:
#split a data in train set and test set
split_date = end_date - datetime.timedelta(days=3) + datetime.timedelta(hours=23, minutes=59,seconds=59)
train = confirmed_case[confirmed_case.index <= split_date]
test = confirmed_case[confirmed_case.index > split_date]

In [None]:
print('train shape: {}\ntest shape : {}'.format(train.shape, test.shape))

In [None]:
def makeXy(ts, nb_timesteps): 
    ''' 
    Input:  
           ts: original time series 
           nb_timesteps: number of time steps in the regressors 
    Output:  
           X: 2-D array of regressors 
           y: 1-D array of target  
    ''' 
    X = [] 
    y = [] 
    for i in range(nb_timesteps, ts.shape[0]): 
        
        X.append(list(ts.iloc[i-nb_timesteps:i])) 
        y.append(ts.iloc[i]) 
    X, y = np.array(X), np.array(y) 
    return X, y 

lookback: How many timesteps back the input data should go.

In [None]:
lookback = 2 # 3days back 

X_train, y_train = makeXy(train['scaled_cases'], lookback) 
print('Shape of train arrays:', X_train.shape, y_train.shape) 

X_test, y_test = makeXy(test['scaled_cases'], lookback) 
print('Shape of test arrays:', X_test.shape, y_test.shape) 

In [None]:
Xtrain = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))

Xtest = X_test.reshape((X_test.shape[0], X_test.shape[1], 1)) 

n=  Xtrain.shape[1]
print('Shape of 3D arrays:', Xtrain.shape, Xtest.shape)

In [None]:
# fix random seed for reproducibility
np.random.seed(7)

In [None]:
#Define input layer which has shape (None, 2) and of type float32. None indicates the number of instances
input_layer = Input(shape=(n,1), dtype='float32')

In [None]:
gru_layer1 = GRU(64, input_shape=(n,1), return_sequences=True)(input_layer)
gru_layer2 = GRU(32, input_shape=(n,64), return_sequences=False)(gru_layer1)

In [None]:
dropout_layer = Dropout(0.2)(gru_layer2)

In [None]:
#Finally the output layer gives prediction for the next day's confirmed case.
output_layer = Dense(1, activation='linear')(dropout_layer)

In [None]:
model = Model(inputs=input_layer, outputs=output_layer)
model.compile(loss='mean_squared_error', optimizer='rmsprop')
model.summary()

In [None]:
history = model.fit(x=Xtrain, y=y_train, batch_size=32, epochs=20,verbose=1, validation_data=(Xtest, y_test))

In [None]:
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.show()

In [None]:
#we compute a prediction
preds = model.predict(Xtest)
pred_covid19 = scaler.inverse_transform(preds)
pred_covid19 = np.squeeze(pred_covid19)

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
#compute score
rmse = np.sqrt(mean_squared_error(test.Confirmed.iloc[n:] , pred_covid19))
print('RMSE for the test set:', round(rmse, 4))

In [None]:
actual_pred = pd.DataFrame()
actual_pred['actual'] = test.Confirmed.iloc[n:]
actual_pred['predict'] =  pred_covid19

In [None]:
valid = actual_pred.reset_index()

In [None]:
valid.head()


In [None]:
valid.groupby('ObservationDate').sum().plot(kind='bar')
plt.ylabel(' Total confirmed')

**Source** https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/

**Source** *Avishek Pal_ PKS Prakash - Practical Time Series Analysis_ Master Time Series Data Processing, Visualization, and Modeling using Python-Packt Publishing (2017)*

**Source** *Deep learning with python  François Chollet*

## Model II: Confirmed or Case behavior over time using growth rate

In this part of our notebook, we are studying the confirmed behavior over time. To do that, we need to approximate our data as follows:
> $\dfrac{dcase}{case} = \alpha(t)$

So, $dcase = case(t+\tau) - case(t)$ and we have:

> $\alpha(t) = \dfrac{case(t+\tau) - case(t)}{case(t)} = \dfrac{case(t+\tau)}{case(t)} -1 $

We see that, $\dfrac{case(t+\tau)}{case(t)}$ is the **Grownth** for case where $ case(t) $ is the case on past day t, and $ case(t+\tau)$ is the case on present day $ t + \tau $. $\tau$ is time between two date.

If we have $t_{i}$, $i=1,2,..N$, we can find the  growth rate $\alpha$ as:
> $\alpha_{\tau} = \dfrac{case(t_{i}+\tau)}{case(t_{i})} -1,\qquad  \forall i \in \mathcal{N}$.

We start. 

In [None]:
#we take time_Obs see code above
time_obs.head()

In [None]:
x = []
x.append(0)
for i in range(time_obs.shape[0]-1):
    a = time_obs.iloc[i+1,0]-time_obs.iloc[i,0]
    x.append(a/time_obs.iloc[i,0])

In [None]:
grown_rate = time_obs.reset_index()
grown_rate['grownRate'] = x
grown_rate.head()

In [None]:
grown_rate.grownRate.plot(figsize=(10,5))
plt.title('Confirmed case Grownth rate ')
plt.ylabel('$Grownth rate$')
plt.xlabel('$tau$')

we have determined a confirmed case grownth rate. We are going to use a model linearRegression and preprocessing our data using polynomialfeature to fit very well a nonlinear relationship.

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor # for next model below
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

In [None]:
trend_model = make_pipeline(PolynomialFeatures(8), LinearRegression(normalize=True, fit_intercept=True))
trend_model.fit(np.array(grown_rate.index).reshape((-1,1)), grown_rate['grownRate'])

In [None]:
print('Trend model coefficient={} and intercept={}'.format(trend_model[1].coef_[0],trend_model[1].intercept_))

In [None]:
dt =np.array(grown_rate.index).reshape((-1,1)) 
fit_grown = trend_model.predict(dt)

In [None]:
errors = grown_rate['grownRate'] - fit_grown

In [None]:
upperlimits = [True, False] 
lowerlimits = [False, True] 
plt.figure(figsize=(10,5))
plt.scatter(dt, grown_rate['grownRate'])
plt.errorbar(dt, fit_grown,yerr = errors,  color='r', label='prediction and errors')
plt.legend(loc='best')
plt.show()

In [None]:
#we compute score
print("Mean Absolute Error : " + str(mean_absolute_error(fit_grown, grown_rate['grownRate'])))

In [None]:
#score
trend_model.score(dt, grown_rate['grownRate'])

In [None]:
from datetime import timedelta
next_date = str(end_date+timedelta(days=1))
new_date = pd.date_range(start=next_date, periods=3)
ndt = np.arange(len(new_date)) +len(time_obs)
print('new date {} correspond to new dt {}'.format(new_date, ndt))

In [None]:
# we compute a new grownth rate 
new_rate = trend_model.predict(ndt.reshape((-1,1)))
print('Rate forecast: {}:'.format(new_rate))

In [None]:
pred_rate = pd.DataFrame()
rate = grown_rate.set_index('ObservationDate')
pred_rate['prediction_grownthRate'] = new_rate 
pred_rate.index=new_date

In [None]:
#we concatenate the two data
data_plot = pd.concat([rate, pred_rate], sort=False)
data_plot.head(1)

In [None]:
data_plot[['grownRate', 'prediction_grownthRate']].plot(figsize=(10,5))
plt.ylabel('growth rate')
plt.title('growth rate forecast')

**Recall course**
## Model for time series
 The purpose of time serie analysis is to develop a mathematical model that can explain the observed behavior of a time and possibly forecast the future state of the serie.
 
 The different model for time serie analysis is:
 
 1- **zeros mean model**
 
 2- **random walk**
 
 3- **trend model**
 
 4- **seasonality model**
 
The 4 steps generic approach of a time serie analysis as follows:

1- *visualize the data at different granularities of the time index to reveal long run trends and seasonal fluctuation .*

2- *fit trend line capture long run trends and plot the residuals to check for seasonality or irreductible error* 

3- *fit a harmonic regression model to capture seasonality* 

4- *plot the residual left by seasonality model to check for irreductible errors.*

Extract from **Avishek Pal_ PKS Prakash - Practical Time Series Analysis_ Master Time Series Data Processing, Visualization, and Modeling using Python-Packt Publishing (2017)**

from this short course, we already do 1) and 2)(middle).

as follows, we try to plot a residual to find if there exist a seasonal or irrecductible errors.

In [None]:
residual = pd.Series(data=errors, index=grown_rate.index)

In [None]:
plt.figure(figsize=(8,8))
residual.plot()
plt.xlabel('time index')
plt.ylabel('Residual')
plt.title('Residual between actual and prediction')

From this plot, there exist seasonality or irreductible errors? we are going to find the answers. let's go.

**Zero mean model**

the zero mean model have a constant mean and constant variance and show no predictable trends and seasonality. Observation from zero mean model are asumed to be independent and identically distrbuted(iid) andd represent random noise around the fixed mean, which has been deducted from the time series as a constant term.

**seasonality model**
there manifest as periodic and repetitive fluctuation in a time serie and hence are modeled as sum of weigted sum of sine waves of known periodicity. 

to know if residual look like one of this two model below, we need to plot acf and pacf 


In [None]:
fig = plt.figure(figsize=(15, 5.5))
fig.subplots_adjust(hspace=0.4, wspace=0.4)
ax1  = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)
plot_acf(residual, ax=ax1, lags=20)
plot_pacf(residual, ax=ax2, lags=20)
plt.show()

I think there exist seasonality we are checking that using decomposition
MA = 1
p = 0

In [None]:
from statsmodels.tsa import seasonal
residual_decompose = seasonal.seasonal_decompose(residual.tolist(), model='additive',period=7)

In [None]:
_=residual_decompose.plot()

We realize that residual have trend and seasonal. So, we cannot continue to use this model we need another model more  adapted for this problem

# Model III: Growth rate with autoregressive Model

In [None]:
#we iinstall a packa
!pip install pmdarima

In [None]:
from statsmodels.graphics.gofplots import qqplot
from pmdarima import *

In [None]:
rate.shape

In [None]:
# check normality with qqplot
_= qqplot(rate.grownRate, line='s')

In [None]:
utils.plot_acf(rate.grownRate, alpha=0.05)

In [None]:
utils.plot_pacf(rate.grownRate, alpha=0.05)

In [None]:
# we split our to tain and test set
r_train, r_test = model_selection.train_test_split(rate.grownRate, train_size=56)

In [None]:
# we decompose a train set to see trend, seasonal and residual
_=seasonal.seasonal_decompose(r_train.tolist(), model='additive',period=7).plot()

## Verify stationarity 

In [None]:
adf_test = arima.ADFTest()
pval, should_diff = adf_test.should_diff(r_train)
print('train set: p-value = {}, should_diff = {}'.format(pval, should_diff))

r_train set is non stationary due to p-value > 0.05

In [None]:
# function for ndiff test
def ndiff_test(train):
    kpss_diffs = arima.ndiffs(train, test='kpss', max_d=6)
    adf_ndiffs = arima.ndiffs(train, test= 'adf', max_d=6 )
    
    return max(adf_ndiffs, kpss_diffs)

In [None]:
print('train set: Estimated differencing term: {}'.format(ndiff_test(r_train)) )

In [None]:
# estimate a seasonal differencing term D
D = arima.nsdiffs(r_train, m=7)
print('train set: seasonal differencing term: {}'.format(D))

**Train model**

In [None]:
from pmdarima.pipeline import Pipeline
import pmdarima as pm
arif = Pipeline([('boxcox', preprocessing.BoxCoxEndogTransformer(lmbda2=1e-6)), \
                 ('arima', pm.AutoARIMA(trace=True, suppress_warnings=True))])
arif.fit(r_train)

In [None]:
arif.summary()

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
def forecast_one_step(modeled, n=1):
    fc = modeled.predict(n_periods=n, return_conf_int=False)
    #_, conf_init =  modeled.predict(n_periods=n, return_conf_int=True, inverse_transform=False)
    return fc.tolist()[0]#, np.asanyarray(conf_init).tolist()[0])

def update_model(test, models):
    forecasts = []
    #confidence_intervals = []
    for new_ob in test:
        fc = forecast_one_step(models)
        forecasts.append(fc)
       # confidence_intervals.append(conf)
    
        # updates the existing model 
        models.update(np.array([new_ob]))
        
    return forecasts#, confidence_intervals

In [None]:
fcast = update_model(r_test, arif)

In [None]:
print('MSE: {}'.format(np.sqrt(mean_squared_error(r_test,fcast))))

In [None]:
resid = r_test - fcast

In [None]:
sns.distplot(resid)

In [None]:
plot_acf(resid, alpha=0.05)

our residual is iid.

In [None]:
plt.figure(figsize=(10,5))
resid.plot()

## Viewing forecast

In [None]:
# function for plotting
def viewing_forecast(train, test, forecast, train_label, test_label, fc_label, title):
    
    plt.figure(figsize=(12, 6))
    fc_series =pd.Series(forecast, index=test.index)
    ax = train.plot(label=train_label)
    fc_series.plot(ax=ax, label=fc_label, alpha=0.7)
    test.plot(ax=ax, label=test_label, alpha=0.7, color='green')
    a = train.index.to_list()
    b = train.index.max()
    n = a.index(b)
    ax.vlines(train.index[n], train.min(), train.max(), linestyles='dashdot', colors='r',\
              label='stop train set')
    ax.set_xlabel('Date')
    ax.set_ylabel(' Growth rate')
    ax.set_title(title)
    plt.legend()  

In [None]:
viewing_forecast(r_train[22:], r_test, fcast, train_label='actual train', test_label='actual test', \
                 fc_label = 'prediction',\
                 title = 'Actual-Prediction plot')

What happens in the next days?

In [None]:
#if we need to know confirmed growth rate in the next day, we must take data for 5 days before for example.
fcast # we see the date predicted

In [None]:
thr_date = end_date - timedelta(days=3)
previous_data = pd.Series(fcast[-5:-1], index=r_test.index[r_test.index >= thr_date])
previous_data

In [None]:
fcast_data = update_model(r_test[3:] , arif)
print('forecast data: \n {}\n'.format(fcast_data)) 
fdate = pd.date_range(start=end_date+timedelta(days=1), periods=len(fcast_data))
print('Correspond to date:\n {}'.format(fdate))

In [None]:
# serie 
forecast_data = pd.Series(fcast_data, index=fdate)

# concatenate previous and forecast data in evolution DataFrame
evolution = pd.DataFrame()
evolution = pd.concat([previous_data, forecast_data], sort=False)

#check
evolution.head(12)

In [None]:
#plotting
ax = evolution[:5].plot(figsize=(15,5), color='red', label='predicted', legend=True)
r_test[4:] .plot(ax=ax, label='actual', legend=True)
evolution[4:].plot(ax=ax, label='forecast',color='green', legend=True)
ax.vlines(r_test.index[-1], r_test.min(), r_test.max(), linestyles='dashdot', colors='black',\
              label='stop')
ax.set_ylabel(' Growth rate')
plt.title('Growth rate covid-19 evolution')
plt.legend(loc= 'best')  

# Model IV: Confirmed growth rate with Prophet

In [None]:
#importing package
from fbprophet import Prophet

In [None]:
rate.head()

In [None]:
# we respect the prophet structure data 
pdata = rate.grownRate

In [None]:
# we check if all is ok.
pdat = pdata.reset_index()
pdat = pdat.rename(columns={'ObservationDate':'ds', 'grownRate':'y'})
pdat.head()

In [None]:
m = Prophet(interval_width=0.95,changepoint_prior_scale=1.05)
m.fit(pdat)

In [None]:
# future days
futureDays = m.make_future_dataframe(periods=12)
futureDays.tail()

In [None]:
growth_rate_forecast = m.predict(futureDays)

In [None]:
growth_rate_forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

In [None]:
#we plot graph
graph = m.plot(growth_rate_forecast)
plt.title('growth rate worldwide forecasting')

In [None]:
graph1 = m.plot_components(growth_rate_forecast)

I think that **Period = 7 days**

### Diagnostic our model IV
we are starting:

In [None]:
from fbprophet.diagnostics import cross_validation
from fbprophet.diagnostics import performance_metrics

In [None]:
#for cross validation we are taking the range of our data 
df_cv = cross_validation(m, initial='30 days', period='2 days', horizon = '14 days')
df_cv.head(3)

In [None]:
df_p = performance_metrics(df_cv)
df_p.head()

In [None]:
from fbprophet.plot import plot_cross_validation_metric
ufig = plot_cross_validation_metric(df_cv, metric='mape')

Ok, we are finding a good model (**MAPE between 2 and 3**) that is fixing well our problem. So, we can now forecast evolution of covid 19 in the worldwide.

I think that in the next days or month(April) China will not concern by this disease because their situation improve better and better many people who are affected by covid 19 recover a health. contrary in the rest of the world.

The next part in this notebook is to make a growth rate comparison between China and ROW. And if possible define the speed of contamination (confirmed cases)  

# Comparison of Confirmed Growth Rate and Forecasting (China vs ROW)

In [None]:
# we define a function that compute the growth rate
def grate(obs = None):
    x = []
    x.append(0)
    for i in range(obs.shape[0]-1):
        a = obs.iloc[i+1] - obs.iloc[i]
        x.append(a/obs.iloc[i])
        
    y = pd.DataFrame(x, columns=['growth_rate'],index=obs.index)
    
    return y.reset_index()

In [None]:
china_obs = china.groupby('ObservationDate')['Confirmed'].agg('sum')
row_obs = rest_world.groupby('ObservationDate')['Confirmed'].agg('sum')

In [None]:
import plotly.offline as py
import plotly.express as px
import cufflinks as cf
py.init_notebook_mode(connected=False)
cf.set_config_file(offline=True)

In [None]:
#plt.figure(figsize=(10,5))
#china_obs.plot()
#row_obs.plot()
#plt.ylabel('Cummulative Confirmed Cases')
#plt.title('China(Blue) vs ROW(Red) Covid 19 diseases')
udf = pd.DataFrame({'chinaConfirmedCases':china_obs, 'rowConfirmedCases':row_obs})
udf.iplot(title='Comparison between confirmed cases in China and ROW')

Something is happening in that curve. **why china curve and row curve have a same break point?**. At this date 2020-02-12 appears a ncovid-19 and also date 2020-02-13 more people (15133) in china are confirmed cases.

When we see a row curve, it is same with china curve but the day gap is 30 days (2020-02-12 to 2020-03-12). And again 2020-03-13 more people (16589) in  the rest of the world are are comfirmed cases at this date.

**From these curves, my question is as follows: the break in the curve on 2020-03-12, tells us that there is another ncovid 19 in the rest of the world or else the ncovid 19 has mutated and has become something else on that date.** Please tell me if I'm wrong!

In [None]:
# we compute rate

rate_china = grate(china_obs)
rate_row = grate(row_obs)

In [None]:
cr = pd.DataFrame({'China_growth_rate':rate_china.growth_rate, 'Row_growth_rate':rate_row.growth_rate})
cr.index = rate_china.ObservationDate
cr.head()

In [None]:
cr.iplot(title='Comparison growth rate between China and ROW')

# China growth rate forecast

In [None]:
prate_china = rate_china.rename(columns={'ObservationDate':'ds', 'growth_rate':'y'})

In [None]:
mc = Prophet(interval_width=0.95,changepoint_prior_scale=1.05)
mc.fit(prate_china)

In [None]:
cfutureDays = mc.make_future_dataframe(periods=10)
cfutureDays.tail()

In [None]:
growth_china = mc.predict(cfutureDays)

In [None]:
_=mc.plot(growth_china)

In [None]:
_= mc.plot_components(growth_china)

# ROW growth rate forecast

In [None]:
prate_row = rate_row.rename(columns={'ObservationDate':'ds', 'growth_rate':'y'})

In [None]:
mr = Prophet(interval_width=0.95, changepoint_prior_scale=4.05)
mr.fit(prate_row)

In [None]:
rfutureDays = mr.make_future_dataframe(periods=10)

In [None]:
growth_row = mr.predict(rfutureDays)

In [None]:
_= mr.plot(growth_row)

In [None]:
_= mr.plot_components(growth_row)

# Seasonality of Confirmed cases

We know that time series can be expressed as **ts = trend + seasonal + cyclical + irregular**. But here we study only a seasonal part that contain a confirmed cases feature. To do that, we use this way:

1- visualize weekly

2- Decompose our time series to trend, seasonal and irregular

3- give some conclusion

** we use china_obs data for this work**

**China**

In [None]:
china_obs.head()

In [None]:
# we cumulate a number of confirmed cases weekly. so we can use resample method
weekly_cases = china_obs.resample('W').aggregate([np.mean])

In [None]:
weekly_cases = weekly_cases.reset_index()

In [None]:
weekly_cases.iplot(x='ObservationDate', y='mean', title='Weekly Confirmed Cases Covid19 China',
                  xTitle='Date', yTitle='Average confirmed cases')

We can approximate **average confirmed cases** like a logistic function.

we said ACC abbr Average Confirmed Cases in weekly is a formula

> $ ACC(t_{weekly}) = \dfrac{L}{1+exp(-k(t_{weekly}-t_{0,weekly})}) $

where

- $L$:  the curve's maximum value, and

- $k$:  the logistic growth rate or steepness of the curve.

- $t_{0,weekly}$: the $t_{weekly}-value$ of the sigmoid's midpoint. time is counted per week

The derivative is:
> $\dfrac{dACC}{dt} = k\:ACC\:(1 - \dfrac{ACC}{L})$

If I consider ACC a average of people affected by covid 19 per week, we can say that $L$ is the carrying capacity of the population affected;(the maximum population affected size that a particular environment can sustain). 

In [None]:
decompose_weekly = seasonal.seasonal_decompose(x=china_obs, model='additive', period=7)

In [None]:

decompose_weekly.seasonal.iplot()


**Period = 7 days. this period can be a time that some people affected by covid 19 transmit a disease to another people or the season of covid 19 to generate or mutate.**

In [None]:
# transmission rate can be a average confirmed case rate
print('Transmission rate in china: {} cases per day'.format(800/7))

## Upnext!

Source https://www.kaggle.com/robikscube/tutorial-time-series-forecasting-with-xgboost