<a href="https://colab.research.google.com/github/JOSEPHREDDY07/Ineuron_ml_deployment_trng/blob/master/Covid_19_Predictions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**<span style="color: red;">I would like to thank all those who expressed their gratitude for my previous prediction which were close enough to the actual confirmed cases.</span>**

**<span style="color: red;">This is an updated version of my previous prediction. I have been receiving queries from people regarding my future predictions. So here it is for those who have been eagerly waiting for it.</span>**

**<span style="color: green;">In this new version I have also predicted the death and recovery rate.</span>**

# This notebook has been divided into three sections

### Section 1- Simple time series analysis of covid cases using general forecasting models and ARIMA model with taking any exogenous features into account.

### Section 2- Here the no of confirmed cases has been predicted taking into account the no of tests(extrapolating it five days down the line).

### Section 3- Here the average growth rate of covid cases,recovery rate and death rate has been analyzed and further time series analysis has been performed for prediction of deceased cases and recovered cases.

# Introduction

> This notebook predicts outcome of confirmed cases in India in the forthcoming week.

> Different forecasting methods such as
  
  
  -Simple Exponential Smoothing
  
  -Holt Winter's Method
  
  -SARIMA Model
 
 > have been used in predicting the outcomes.
 
> The data has been taken from www.covid19india.org

> This kernel is an updated version of my previous one

> I would like to thank those who expressed their gratitude by upvoting and people were also 

In [0]:
# Importing relevant libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [0]:
import warnings
warnings.filterwarnings("ignore")

In [0]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [0]:
df=pd.read_csv('/kaggle/input/covid19updated/case_time_series.csv')
testing=pd.read_csv('/kaggle/input/updated-testing/tested_numbers_icmr_data.csv')

In [0]:
df.info()

In [0]:
testing.info()

In [0]:
df.head()

In [0]:
df.tail()

In [0]:
df['Date'] = df['Date'].str.replace(' ','-')
df['Date'] = df['Date'].str.replace('January','01')
df['Date'] = df['Date'].str.replace('February','02')
df['Date'] = df['Date'].str.replace('March','03')
df['Date'] = df['Date'].str.replace('April','04')
df['Date'] = df['Date'].str.replace('May','05')

In [0]:
df.tail()

In [0]:
df.loc[:,'Date'] = df.loc[:,'Date']+'2020'

In [0]:
df.tail()

In [0]:
df['Date']=pd.to_datetime(df['Date'],format='%d-%m-%Y')

In [0]:
df.head()

In [0]:
# Making the date column as index
df.index=df['Date']
df.drop(['Date'],axis=1,inplace=True)

In [0]:
df.head()

In [0]:
# Setting the frequency to Daily basis.
df=df.asfreq(freq='D')

In [0]:
# Plot of Daily Confirmed Cases in India
df['Daily Confirmed'].plot(figsize=(22,6),title='Daily Confirmed Cases');

## ETS Decomposition

In [0]:
from statsmodels.tsa.seasonal import seasonal_decompose

In [0]:
results = seasonal_decompose(df['Daily Confirmed'])

In [0]:
results.plot();

In [0]:
results.seasonal.plot(figsize=(20,10));

In [0]:
results.trend.plot(figsize=(20,10));

In [0]:
len(df)

### Splitting the data into training and testing set

In [0]:
train=df.iloc[:93]
test=df.iloc[93:]

As there were no reported cases in India from 4th February 2020 to 1st March 2020. So the curve isn't strictly increasing as a result we cannot use multiplicative trend or assume seasonality to be multiplicative so we have only used additive trend.

### Simple Exponential Smoothing

In [0]:
# Simple Exponential Smoothing

from statsmodels.tsa.holtwinters import SimpleExpSmoothing

span = 5
alpha = 2/(span+1)

df['EWMA5'] = df['Daily Confirmed'].ewm(alpha=alpha,adjust=False).mean()
df['SES5']=SimpleExpSmoothing(df['Daily Confirmed']).fit(smoothing_level=alpha,optimized=False).fittedvalues.shift(-1)
df.head()

In [0]:
df[['Daily Confirmed','EWMA5','SES5']].iloc[-14:].plot(figsize=(12,6)).autoscale(axis='x',tight=True);

### Double Exponential Smoothing

In [0]:
from statsmodels.tsa.holtwinters import ExponentialSmoothing

# Double Exponential Smoothing

df['DESadd5'] = ExponentialSmoothing(df['Daily Confirmed'], trend='add').fit().fittedvalues.shift(-1)
df.head()

In [0]:
df[['Daily Confirmed','EWMA5','DESadd5']].iloc[-14:].plot(figsize=(12,6)).autoscale(axis='x',tight=True);

Here we can see that Double Exponential Smoothing is a much better representation of the time series data than Simple Exponential Smoothing.<br>
Let's see if using a multiplicative trend adjustment helps.

___
## Triple Exponential Smoothing
Triple Exponential Smoothing, the method most closely associated with Holt-Winters, adds support for both trends and seasonality in the data. 


In [0]:
df['TESadd5'] = ExponentialSmoothing(df['Daily Confirmed'],trend='add',seasonal='add',seasonal_periods=5).fit().fittedvalues
df.head()

In [0]:
df[['Daily Confirmed','TESadd5']].iloc[-14:].plot(figsize=(12,6)).autoscale(axis='x',tight=True);

In [0]:
from statsmodels.tsa.stattools import adfuller

def adf_test(series,title=''):
    """
    Pass in a time series and an optional title, returns an ADF report
    """
    print(f'Augmented Dickey-Fuller Test: {title}')
    result = adfuller(series.dropna(),autolag='AIC') # .dropna() handles differenced data
    
    labels = ['ADF test statistic','p-value','# lags used','# observations']
    out = pd.Series(result[0:4],index=labels)

    for key,val in result[4].items():
        out[f'critical value ({key})']=val
        
    print(out.to_string())          # .to_string() removes the line "dtype: float64"
    
    if result[1] <= 0.05:
        print("Strong evidence against the null hypothesis")
        print("Reject the null hypothesis")
        print("Data has no unit root and is stationary")
    else:
        print("Weak evidence against the null hypothesis")
        print("Fail to reject the null hypothesis")
        print("Data has a unit root and is non-stationary")

In [0]:
adf_test(df['Daily Confirmed'])

In [0]:
from statsmodels.tsa.statespace.tools import diff
df['d1'] = diff(df['Daily Confirmed'],k_diff=1)

adf_test(df['d1'],'')

In [0]:
from statsmodels.tsa.statespace.tools import diff
df['d2'] = diff(df['Daily Confirmed'],k_diff=2)

adf_test(df['d2'],'')

In [0]:
pip install pyramid-arima

In [0]:
from pyramid.arima import auto_arima

In [0]:
train=df.iloc[:93]
test=df.iloc[93:]

In [0]:
# Auto Arima Model

stepwise_model = auto_arima(train['Daily Confirmed'], start_p=0, start_q=0, max_p=5, max_q=5,m=5,seasonality=True,
d=2,D=2,trace=True,error_action='ignore',suppress_warnings=True,stepwise=True)

print(stepwise_model.aic())

In [0]:
from statsmodels.tsa.statespace.sarimax import SARIMAX

In [0]:
model = SARIMAX(train['Daily Confirmed'],order=(3,2,2),seasonal_order=(0,2,2,5),enforce_invertibility=True)
results = model.fit()
results.summary()

In [0]:
start=len(train)
end=len(train)+len(test)-1
predictions = results.predict(start=start, end=end, dynamic=False,typ='levels').rename('SARIMA(2,2,1)(0,2,2,5) Predictions')

In [0]:
title='Covid-19 India Daily Confirmed Cases'
ylabel='Persons'

ax = test['Daily Confirmed'].plot(legend=True,figsize=(12,6),title=title)
predictions.plot(legend=True);
ax.autoscale(axis='x',tight=True);
ax.set(ylabel=ylabel);

In [0]:
from statsmodels.tools.eval_measures import rmse,meanabs

error = rmse(test['Daily Confirmed'], predictions)
print(f'SARIMAX(2,2,1)(0,2,2,5) RMSE Error: {error:11.10}')

In [0]:
model = SARIMAX(df['Daily Confirmed'],order=(3,2,2),seasonal_order=(0,2,2,5),enforce_invertibility=True)
results = model.fit()
fcast = results.predict(len(df),len(df)+5).rename('SARIMAX(3,2,2)(0,2,2,5) Forecast')

In [0]:
fcast

In [0]:
title='Confirmed patients for covid-19 in India'
ylabel='Patients'
ax = df['Daily Confirmed'].plot(legend=True,figsize=(12,6),title=title)
fcast.plot(legend=True);
ax.autoscale(axis='x',tight=True);
ax.set(ylabel=ylabel);

In [0]:
# Creating a new DataFrame for cumulative sum of confirmed cases in India.

date1 = '2020-01-31'
date2 = '2020-05-12'
mydates = pd.date_range(date1, date2).tolist()
len(mydates)

In [0]:
columns=['date','Patients','Total Confirmed']
final = pd.DataFrame(columns=columns)

In [0]:
final['date']=mydates

In [0]:
final.index=final['date']
final.drop(['date'],axis=1,inplace=True)

In [0]:
final['Patients']=df['Daily Confirmed']
final['Total Confirmed']=df['Total Confirmed']

In [0]:
final.tail(7)

In [0]:
final=final.reset_index()

In [0]:
for i in range(6):
    final.loc[97+i:,'Patients']=fcast[i]

In [0]:
final.info()

In [0]:
final=final.round()

In [0]:
final.info()

In [0]:
final.tail(10)

In [0]:
for i in range(6):
    final.loc[97+i,'Total Confirmed']=final.loc[97+i-1,'Total Confirmed']+final.loc[97+i,'Patients']

In [0]:
final.tail()

# Section 2- Taking no of tests into account.

In [0]:
testing.head()

In [0]:
testing['Update Time Stamp']=testing['Update Time Stamp'].str.replace('/','-')
testing['Update Time Stamp']=testing['Update Time Stamp'].str.replace(' ','')
testing.head()

In [0]:
testing1=testing[['Update Time Stamp','Total Tested']]
testing1=testing1.rename(columns={'Update Time Stamp':'Date'})
testing1['Date']=pd.to_datetime(testing1['Date'],format='%d-%m-%Y')
testing1.info()

In [0]:
testing1=testing1[1:]

In [0]:
testing1=testing1.reset_index()

In [0]:
testing1.drop(['index'],axis=1,inplace=True)

## Extrapolating the no of tests five days down the line.

In [0]:
testing1['dayofweek']=testing1['Date'].dt.dayofweek
testing1.head()

In [0]:
testing1.info()

In [0]:
train=testing1[:48]
test=testing1[48:]

### ETS decomposition of testing data.

In [0]:
testing1.index=testing1['Date']
testing1.drop(['Date'],axis=1,inplace=True)

In [0]:
results = seasonal_decompose(testing1['Total Tested'])

In [0]:
results.seasonal.plot(figsize=(15,8));

In [0]:
results.trend.plot();

Simple Exponential Smoothing

In [0]:
# Simple Exponential Smoothing

from statsmodels.tsa.holtwinters import SimpleExpSmoothing

span = 5
alpha = 2/(span+1)

testing1['EWMA5'] = testing1['Total Tested'].ewm(alpha=alpha,adjust=False).mean()
testing1['SES5']=SimpleExpSmoothing(testing1['Total Tested']).fit(smoothing_level=alpha,optimized=False).fittedvalues.shift(-1)
testing1.head()

In [0]:
testing1[['Total Tested','EWMA5','SES5']].iloc[-14:].plot(figsize=(12,6)).autoscale(axis='x',tight=True);

Double Exponential Smoothing

In [0]:
# Double Exponential Smoothing

testing1['DESadd5'] = ExponentialSmoothing(testing1['Total Tested'], trend='add').fit().fittedvalues.shift(-1)
testing1.head()

In [0]:
testing1[['Total Tested','EWMA5','DESadd5']].iloc[-14:].plot(figsize=(12,6)).autoscale(axis='x',tight=True);

Here we can see that Double Exponential Smoothing is a much better representation of the time series data than Simple Exponential Smoothing.
Let's see if using a multiplicative trend adjustment helps.

In [0]:
testing1['DESmul7'] = ExponentialSmoothing(testing1['Total Tested'], trend='mul').fit().fittedvalues.shift(-1)
testing1.head()
testing1[['Total Tested','DESmul7','DESadd5']].iloc[-14:].plot(figsize=(12,6)).autoscale(axis='x',tight=True);

## Triple Exponential Smoothing

Triple Exponential Smoothing, the method most closely associated with Holt-Winters, adds support for both trends and seasonality in the data.

In [0]:
testing1['TESadd7'] = ExponentialSmoothing(testing1['Total Tested'],trend='add',seasonal='add',seasonal_periods=7).fit().fittedvalues
testing1.head()
testing1[['Total Tested','TESadd7']].iloc[-14:].plot(figsize=(12,6)).autoscale(axis='x',tight=True);

In [0]:
testing1['TESmul7'] = ExponentialSmoothing(testing1['Total Tested'],trend='mul',seasonal='mul',seasonal_periods=7).fit().fittedvalues
testing1.head()
testing1[['Total Tested','TESadd7','TESmul7']].iloc[-14:].plot(figsize=(12,6)).autoscale(axis='x',tight=True);

In [0]:
testing1.info()

In [0]:
train=testing1.iloc[:46]
test=testing1.iloc[46:]

In [0]:
fitted_model = ExponentialSmoothing(train['Total Tested'],trend='add').fit()

In [0]:
test_predictions = fitted_model.forecast(7).rename('Forecast')

In [0]:
test['Total Tested'].plot(legend=True,label='TEST',figsize=(12,8))
test_predictions.plot(legend=True,label='PREDICTION')

In [0]:
# Checking the RMSE for Double exponential exponential smoothing.

from statsmodels.tools.eval_measures import rmse,meanabs

print(rmse(test['Total Tested'],test_predictions))

In [0]:
fitted_model = ExponentialSmoothing(train['Total Tested'],trend='mul').fit()
test_predictions = fitted_model.forecast(7).rename('Forecast')
test['Total Tested'].plot(legend=True,label='TEST',figsize=(12,8))
test_predictions.plot(legend=True,label='PREDICTION');

In [0]:
print(rmse(test['Total Tested'],test_predictions))

In [0]:
fitted_model = ExponentialSmoothing(train['Total Tested'],trend='add',seasonal='add',seasonal_periods=7).fit()
test_predictions = fitted_model.forecast(7).rename('Forecast')
test['Total Tested'].plot(legend=True,label='TEST',figsize=(12,8))
test_predictions.plot(legend=True,label='PREDICTION');

In [0]:
print(rmse(test['Total Tested'],test_predictions))

In [0]:
fitted_model = ExponentialSmoothing(train['Total Tested'],trend='mul',seasonal='mul',seasonal_periods=7).fit()
test_predictions = fitted_model.forecast(7).rename('Forecast')
test['Total Tested'].plot(legend=True,label='TEST',figsize=(12,8))
test_predictions.plot(legend=True,label='PREDICTION');

In [0]:
print(rmse(test['Total Tested'],test_predictions))

RMSE is too high so going for ARIMA Model.

In [0]:
# Checking for stationarity
adf_test(testing1['Total Tested'])

In [0]:
testing1['d1'] = diff(testing1['Total Tested'],k_diff=1)

adf_test(testing1['d1'],'')

In [0]:
stepwise_model = auto_arima(train['Total Tested'], start_p=0, start_q=0, max_p=5, max_q=5, m=7,start_P=0, seasonal=True,
d=1, D=1, trace=True,error_action='ignore',suppress_warnings=True,stepwise=True,exogenous=train[['dayofweek']])

print(stepwise_model.aic())

In [0]:
model = SARIMAX(train['Total Tested'],order=(1,1,1),seasonal_order=(0,1,2,7),exogenous=train[['dayofweek']],enforce_invertibility=True)
results = model.fit()
results.summary()

In [0]:
start=len(train)
end=len(train)+len(test)-1
predictions = results.predict(start=start, end=end, dynamic=False,typ='levels',exogenous=test[['dayofweek']]).rename('SARIMA(1,1,1)(0,1,2,7) Predictions')

In [0]:
title='Covid-19 India Daily Testing'
ylabel='Tests'

ax = test['Total Tested'].plot(legend=True,figsize=(12,6),title=title)
predictions.plot(legend=True);
ax.autoscale(axis='x',tight=True);
ax.set(ylabel=ylabel);

In [0]:
error = rmse(test['Total Tested'], predictions)
print(f'SARIMAX(1,1,1)(0,1,2,7) RMSE Error: {error:11.10}')

As ARIMA model has least RMSE so moving ahead with this model.

In [0]:
testing1.info()

In [0]:
model = SARIMAX(testing1['Total Tested'],order=(1,1,1),seasonal_order=(0,1,2,7),enforce_invertibility=True,exogenous=testing1[['dayofweek']])
results = model.fit()
fcast = results.predict(len(testing1),len(testing1)+7).rename('SARIMAX(1,1,1)(0,1,2,7) Forecast')

In [0]:
title='Number of testing for covid-19 in India'
ylabel='Tests'
ax = testing1['Total Tested'].plot(legend=True,figsize=(12,6),title=title)
fcast.plot(legend=True);
ax.autoscale(axis='x',tight=True);
ax.set(ylabel=ylabel);

In [0]:
date1 = '2020-03-18'
date2 = '2020-05-17'
mydates = pd.date_range(date1, date2).tolist()
len(mydates)

In [0]:
columns=['Date','Tests','dayofweek']
tests = pd.DataFrame(columns=columns)

In [0]:
tests['Date']=mydates

In [0]:
tests.info()

In [0]:
tests.index=tests['Date']
tests.drop(['Date'],axis=1,inplace=True)
tests['Tests']=testing1['Total Tested']

In [0]:
tests['dayofweek']=tests.index.dayofweek

In [0]:
tests.info()

In [0]:
tests=tests.reset_index()
tests.loc[53:,'Tests']=fcast

In [0]:
tests.index=tests['Date']
tests.drop(['Date'],axis=1,inplace=True)

In [0]:
tests.loc[53:,'Tests']=fcast

In [0]:
tests=tests.round()

In [0]:
tests.tail()

### Now as we have our estimated tests a week ahead let's predict the total confirmed cases a week ahead taking the number of tests into account.

Merging the tests dataframe and df dataframe for getting no of tests done.

In [0]:
df1=df['2020-03-18':]
df1=df1.reset_index()
df1.head()

In [0]:
covidtest=tests[:'2020-05-06']
exog_test=tests['2020-05-07':'2020-05-12']
covidtest=covidtest.reset_index()
exog_test=exog_test.reset_index()

In [0]:
covidtest.tail()

In [0]:
df_clean=pd.merge(df1,covidtest,on='Date',how='inner')
df_clean.head()

In [0]:
df_clean.info()

In [0]:
train=df_clean[:45]
test=df_clean[45:]

In [0]:
adf_test(df_clean['Daily Confirmed'])

In [0]:
df_clean['d1'] = diff(df_clean['Daily Confirmed'],k_diff=1)


adf_test(df_clean['d1'],'')

In [0]:
# Auto Arima Model

stepwise_model = auto_arima(train['Daily Confirmed'], start_p=0, start_q=0, max_p=5, max_q=5, m=5,start_P=0, seasonal=True,
d=1, D=1, trace=True,error_action='ignore',suppress_warnings=True,stepwise=True,exogenous=train[['Tests','dayofweek','Daily Recovered'
                                                                                                 ,'Daily Deceased']])

print(stepwise_model.aic())

In [0]:
model = SARIMAX(train['Daily Confirmed'],order=(2,1,0),seasonal_order=(0,1,1,5),enforce_invertibility=True,
                exogenous=train[['Daily Recovered','Daily Deceased','Tests','dayofweek']])
results = model.fit()
results.summary()

In [0]:
start=len(train)
end=len(train)+len(test)-1
predictions = results.predict(start=start, end=end, dynamic=False,typ='levels',
                              exogenous=test[['Daily Recovered','Daily Deceased','Tests','dayofweek']]).rename('SARIMA(2,1,0)(0,1,1,5) Predictions')

In [0]:
title='Covid-19 India Daily Confirmed Cases'
ylabel='Persons'

ax = test['Daily Confirmed'].plot(legend=True,figsize=(12,6),title=title)
predictions.plot(legend=True);
ax.autoscale(axis='x',tight=True);
ax.set(ylabel=ylabel);

In [0]:
from statsmodels.tools.eval_measures import rmse,meanabs

error = rmse(test['Daily Confirmed'], predictions)
print(f'SARIMAX(2,1,0)(0,1,1,5) RMSE Error: {error:11.10}')

In [0]:
model = SARIMAX(df_clean['Daily Confirmed'],order=(2,1,0),seasonal_order=(0,1,1,5),enforce_invertibility=True,
                exogenous=df_clean[['Daily Recovered','Daily Deceased','Tests','dayofweek']])
results = model.fit()
results.summary()

In [0]:
start=len(df_clean)
end=len(df_clean)+5
predictions = results.predict(start=start, end=end, dynamic=False,typ='levels',exogenous=exog_test[['Tests','dayofweek']]).rename('SARIMA(2,1,0)(0,1,1,5) Predictions')

In [0]:
title='Covid-19 India Daily Confirmed Cases'
ylabel='Persons'

ax = df_clean['Daily Confirmed'].plot(legend=True,figsize=(12,6),title=title)
predictions.plot(legend=True);
ax.autoscale(axis='x',tight=True);
ax.set(ylabel=ylabel);

In [0]:
predictions

In [0]:
# Creating a new DataFrame for cumulative sum of confirmed cases in India.

date1 = '2020-03-18'
date2 = '2020-05-11'
mydates = pd.date_range(date1, date2).tolist()
len(mydates)

In [0]:
columns=['date','Patients','Total Confirmed','Tests']
final = pd.DataFrame(columns=columns)

In [0]:
final['date']=mydates
final['Patients']=df_clean['Daily Confirmed']
final['Total Confirmed']=df_clean['Total Confirmed']

In [0]:
final['Tests']=df_clean['Tests']

In [0]:
final.info()

In [0]:
final.loc[50:,'Patients']=predictions

In [0]:
for i in range(5):
    final.loc[50+i,'Tests']=exog_test.loc[i,'Tests']

In [0]:
final.info()

In [0]:
final=final.round()

In [0]:
final.info()

In [0]:
for i in range(5):
    final.loc[50+i,'Total Confirmed']=final.loc[50+i-1,'Total Confirmed']+final.loc[50+i,'Patients']

In [0]:
final.index=final['date']
final.drop(['date'],axis=1,inplace=True)

In [0]:
final.tail()

In [0]:
final.to_csv('output.csv')

## Predictions using Regression Models taking testing into account.

In [0]:
df_clean.info()

In [0]:
train=df_clean[:45]
test=df_clean[45:]

In [0]:
train['day'] = train['Date'].dt.day
train['month'] = train['Date'].dt.month
train['dayofyear'] = train['Date'].dt.dayofyear
train['quarter'] = train['Date'].dt.quarter
train['weekofyear'] = train['Date'].dt.weekofyear

In [0]:
train.columns

In [0]:
columns=['Daily Recovered','Daily Deceased', 'Tests','day', 'month', 'dayofweek',
         'dayofyear', 'quarter', 'weekofyear']
y=train['Daily Confirmed']

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
train=train[columns]
x_train,x_test,y_train,y_test=train_test_split(train,y,test_size=0.2,random_state=0)

In [0]:
models = []
mse = []
mae = []
rmse = []

## Random Forest Regressor

In [0]:
from sklearn.ensemble import RandomForestRegressor
reg=RandomForestRegressor(n_estimators=500,random_state=1)
reg.fit(x_train,y_train)

In [0]:
pred_RF=reg.predict(x_test)

In [0]:
# Importing the error metric
from sklearn.metrics import mean_squared_error,mean_absolute_error

In [0]:
models.append('Random Forest')
mse.append(round(mean_squared_error(pred_RF, y_test),2))
mae.append(round(mean_absolute_error(pred_RF, y_test),2))
rmse.append(round(np.sqrt(mean_squared_error(pred_RF, y_test)),2))

## XGB Regressor

In [0]:
from sklearn.ensemble import GradientBoostingRegressor

In [0]:
# Training the algorithm
fit_GB = GradientBoostingRegressor(n_estimators=200)
fit_GB.fit(x_train, y_train)

In [0]:
pred_XGB=fit_GB.predict(x_test)

In [0]:
models.append('XGBoost')
mse.append(round(mean_squared_error(pred_XGB, y_test),2))
mae.append(round(mean_absolute_error(pred_XGB, y_test),2))
rmse.append(round(np.sqrt(mean_squared_error(pred_XGB, y_test)),2))

## LGBM Regressor

In [0]:
from lightgbm import LGBMRegressor

In [0]:
lgbm = LGBMRegressor(n_estimators=1300)
lgbm.fit(x_train,y_train)
pred_LGBM = lgbm.predict(x_test)

In [0]:
models.append('LGBM')
mse.append(round(mean_squared_error(pred_LGBM, y_test),2))
mae.append(round(mean_absolute_error(pred_LGBM, y_test),2))
rmse.append(round(np.sqrt(mean_squared_error(pred_LGBM, y_test)),2))

In [0]:
import seaborn as sb

In [0]:
plt.figure(figsize= (15,10))
plt.xticks(rotation = 90 ,fontsize = 11)
plt.yticks(fontsize = 10)
plt.xlabel("Different Models",fontsize = 20)
plt.ylabel('RMSE',fontsize = 20)
plt.title("RMSE Values of different models" , fontsize = 20)
sb.barplot(x=models,y=rmse);

# Section 3- Analysis of Death and Recovery rate.

## Let's have a look over the Average Growth rate of covid cases in India
 
 -Before first lockdown (upto 24 March 2020)
 
 -During first lockdown (from 25th March 2020 to 14th April 2020)
 
 -During second lockdown (from 15th April 2020 to 3rd May 2020)
 
 -In third lockdown (from 4th May 2020 onwards)
 
 

In [0]:
before_lockdown_growth = []
first_lockdown_growth = []
second_lockdown_growth = []
third_lockdown_growth = []

# As there the continuous reporting of cases have started from 2/3/2020 so truncating the dataframe accordingly.

Before_lockdown=df['2020-03-02':'2020-03-25']
Before_lockdown=Before_lockdown.reset_index()

# Calculating average growth rate before lockdown period

for i in range(1,len(Before_lockdown)):
    before_lockdown_growth.append(Before_lockdown.loc[i,'Daily Confirmed'] / Before_lockdown.loc[i-1,'Daily Confirmed'])


first_lockdown=df['2020-03-25':'2020-04-15']
first_lockdown=first_lockdown.reset_index()

# Calculating average growth rate in first lockdown

for i in range(1,len(first_lockdown)):
    first_lockdown_growth.append(first_lockdown.loc[i,'Daily Confirmed'] / first_lockdown.loc[i-1,'Daily Confirmed'])
    

second_lockdown=df['2020-04-15':'2020-05-04']
second_lockdown=second_lockdown.reset_index()

# Calculating average growth rate in second lockdown

for i in range(1,len(second_lockdown)):
    second_lockdown_growth.append(second_lockdown.loc[i,'Daily Confirmed'] / second_lockdown.loc[i-1,'Daily Confirmed'])


third_lockdown=df['2020-05-04':]
third_lockdown=third_lockdown.reset_index()

# Calculating average growth rate in third lockdown

for i in range(1,len(third_lockdown)):
    third_lockdown_growth.append(third_lockdown.loc[i,'Daily Confirmed'] / third_lockdown.loc[i-1,'Daily Confirmed'])



before_lockdown_growth_factor = sum(before_lockdown_growth)/len(before_lockdown_growth)
first_lockdown_growth_factor = sum(first_lockdown_growth)/len(first_lockdown_growth)
second_lockdown_growth_factor = sum(second_lockdown_growth)/len(second_lockdown_growth)
third_lockdown_growth_factor = sum(third_lockdown_growth)/len(third_lockdown_growth)

print('Average growth factor before lockdown implemented ',before_lockdown_growth_factor)
print('Average growth factor in first lockdown ',first_lockdown_growth_factor)
print('Average growth factor in second lockdown ',second_lockdown_growth_factor)
print('Average growth factor in third lockdown ',third_lockdown_growth_factor)

# Prediction using average growth factor over the entire period 
### Assuming the same growth factor continues for the next 15 days

In [0]:
growth_diff = []

df1=df['2020-03-02':]
df1=df1.reset_index()

for i in range(1,len(df1)):
    growth_diff.append(df1.loc[i,'Daily Confirmed'] / df1.loc[i-1,'Daily Confirmed'])

growth_factor = sum(growth_diff)/len(growth_diff)
print('Average growth factor',growth_factor)

In [0]:
date1 = '2020-01-30'
date2 = '2020-05-21'
prediction_dates = pd.date_range(date1, date2).tolist()

In [0]:
columns=['date','Patients','Total Confirmed']
confirmed = pd.DataFrame(columns=columns)
confirmed['date']=prediction_dates

In [0]:
confirmed.index=confirmed['date']
confirmed.drop(['date'],axis=1,inplace=True)

In [0]:
confirmed['Patients']=df['Daily Confirmed']
confirmed['Total Confirmed']=df['Total Confirmed']

In [0]:
previous_day_cases=df.loc['2020-05-06','Daily Confirmed']
predicted_cases = []

for i in range(15):
    predicted_value = previous_day_cases *  growth_factor
    predicted_cases.append(predicted_value)
    previous_day_cases = predicted_value

In [0]:
confirmed=confirmed.reset_index()

In [0]:
confirmed.loc[98:,'Patients']=predicted_cases

In [0]:
confirmed.info()

In [0]:
for i in range(15):
    confirmed.loc[98+i,'Total Confirmed']=confirmed.loc[98+i-1,'Total Confirmed']+confirmed.loc[98+i,'Patients']

In [0]:
confirmed.index=confirmed['date']
confirmed.drop(['date'],axis=1,inplace=True)

In [0]:
title='Covid-19 India Total Confirmed Cases'
ylabel='Persons'

ax = confirmed['Total Confirmed'].iloc[-5:].plot(legend=True,figsize=(20,6),title=title,linestyle='-',color='c')
ax.autoscale(axis='x');
ax.set(ylabel=ylabel);

In [0]:
confirmed=confirmed.round()
confirmed.tail()

We could see that the graph is increasing exponentialy if the average growth factor doesn't decrease. It is important that the growth factor is reduced to flatten the curve.

## Let's have a look at the average recovery rate.

In [0]:
df.head()

In [0]:
recovery_diff = []

df1=df['2020-03-23':]
df1=df1.reset_index()

for i in range(1,len(df1)):
    recovery_diff.append(df1.loc[i,'Daily Recovered'] / df1.loc[i-1,'Daily Recovered'])

recovery_factor = sum(recovery_diff)/len(recovery_diff)
print('Average recovery factor',recovery_factor)

In [0]:
confirmed['Daily Recovered']=df['Daily Recovered']

In [0]:
confirmed['Total Recovered']=df['Total Recovered']
confirmed['Daily Deceased']=df['Daily Deceased']
confirmed['Total Deceased']=df['Total Deceased']

In [0]:
confirmed.info()

In [0]:
previous_day_recovery=df.loc['2020-05-06','Daily Recovered']
predicted_recovery = []

for i in range(15):
    predicted_value = previous_day_recovery *  recovery_factor
    predicted_recovery.append(predicted_value)
    previous_day_cases = predicted_value
    
confirmed=confirmed.reset_index()
confirmed.loc[98:,'Daily Recovered']=predicted_recovery

In [0]:
for i in range(15):
    confirmed.loc[98+i,'Total Recovered']=confirmed.loc[98+i-1,'Total Recovered']+confirmed.loc[98+i,'Daily Recovered']

In [0]:
confirmed.tail()

# ETS Decomposition for recovered cases

In [0]:
results = seasonal_decompose(df['Daily Recovered'])
results.plot();

In [0]:
results.seasonal.plot(figsize=(20,10));

In [0]:
# Trend of Daily recovered cases
results.trend.plot(figsize=(20,10));

In [0]:
# As the recovery of cases started from 23/03/2020
df1=df['2020-03-23':]
len(df1)

In [0]:
train=df1.iloc[:40]
test=df1.iloc[40:]

In [0]:
# Simple Exponential Smoothing

from statsmodels.tsa.holtwinters import SimpleExpSmoothing

span = 5
alpha = 2/(span+1)

df1['EWMA5'] = df1['Daily Recovered'].ewm(alpha=alpha,adjust=False).mean()
df1['SES5']=SimpleExpSmoothing(df1['Daily Recovered']).fit(smoothing_level=alpha,optimized=False).fittedvalues.shift(-1)
df1.head()

In [0]:
df1[['Daily Recovered','EWMA5','SES5']].iloc[-14:].plot(figsize=(12,6)).autoscale(axis='x',tight=True);

Double Exponential Smoothing

In [0]:
# Double Exponential Smoothing

df1['DESadd5_recovery'] = ExponentialSmoothing(df1['Daily Recovered'], trend='add').fit().fittedvalues.shift(-1)
df1.head()

In [0]:
df1[['Daily Recovered','EWMA5','DESadd5_recovery']].iloc[-14:].plot(figsize=(12,6)).autoscale(axis='x',tight=True);

Here we can see that Double Exponential Smoothing is a much better representation of the time series data than Simple Exponential Smoothing.
Let's see if using a multiplicative trend adjustment helps.

In [0]:
df1['DESmul5_recovery'] = ExponentialSmoothing(df1['Daily Recovered'], trend='mul').fit().fittedvalues.shift(-1)
df1[['Daily Recovered','DESadd5_recovery','DESmul5_recovery']].iloc[-14:].plot(figsize=(12,6)).autoscale(axis='x',tight=True);

## Triple Exponential Smoothing

Triple Exponential Smoothing, the method most closely associated with Holt-Winters, adds support for both trends and seasonality in the data.

In [0]:
df1['TESadd5'] = ExponentialSmoothing(df1['Daily Recovered'],trend='add',seasonal='add',seasonal_periods=5).fit().fittedvalues
df1.head()

In [0]:
df1[['Daily Recovered','TESadd5']].iloc[-14:].plot(figsize=(12,6)).autoscale(axis='x',tight=True);

In [0]:
df1['TESmul5'] = ExponentialSmoothing(df1['Daily Recovered'],trend='mul',seasonal='mul',seasonal_periods=5).fit().fittedvalues
df1[['Daily Recovered','TESadd5','TESmul5']].iloc[-14:].plot(figsize=(12,6)).autoscale(axis='x',tight=True);

Testing for stationarity of recovery data.

In [0]:
adf_test(df1['Daily Recovered'])

In [0]:
df1['d1_recovery'] = diff(df1['Daily Recovered'],k_diff=1)

adf_test(df1['d1_recovery'],'')

In [0]:
# Auto Arima Model

stepwise_model = auto_arima(train['Daily Recovered'], start_p=0, start_q=0, max_p=5, max_q=5,m=7,seasonality=True,
d=1,D=1,trace=True,error_action='ignore',suppress_warnings=True,stepwise=True)

print(stepwise_model.aic())

In [0]:
model = SARIMAX(train['Daily Recovered'],order=(2,1,1),seasonal_order=(0,1,2,7),enforce_invertibility=True)
results = model.fit()
results.summary()

In [0]:
start=len(train)
end=len(train)+len(test)-1
predictions = results.predict(start=start, end=end, dynamic=False,typ='levels').rename('SARIMA(2,1,1)(0,1,2,7) Predictions')

In [0]:
title='Covid-19 India Daily Recovered Cases'
ylabel='Persons'

ax = test['Daily Recovered'].plot(legend=True,figsize=(12,6),title=title)
predictions.plot(legend=True);
ax.autoscale(axis='x',tight=True);
ax.set(ylabel=ylabel);

In [0]:
predictions

In [0]:
error = rmse(test['Daily Recovered'], predictions)
print(f'SARIMAX(2,1,1)(0,1,2,7) RMSE Error: {error:11.10}')

Predicting for one week ahead.

In [0]:
model = SARIMAX(df1['Daily Recovered'],order=(2,1,1),seasonal_order=(0,1,2,7),enforce_invertibility=True)
results = model.fit()
results.summary()

In [0]:
start=len(df1)
end=len(df1)+5
predictions = results.predict(start=start, end=end, dynamic=False,typ='levels').rename('SARIMA(2,1,1)(0,1,2,7) Predictions')

In [0]:
title='Covid-19 India Daily Recovered Cases'
ylabel='Persons'

ax = df1['Daily Recovered'].plot(legend=True,figsize=(12,6),title=title)
predictions.plot(legend=True);
ax.autoscale(axis='x',tight=True);
ax.set(ylabel=ylabel);

In [0]:
predictions

In [0]:
date1 = '2020-01-30'
date2 = '2020-05-12'
mydates = pd.date_range(date1, date2).tolist()
len(mydates)

In [0]:
df.info()

In [0]:
a=pd.DataFrame()
a['Date']=mydates
a.info()

In [0]:
a.index=a['Date']
a.drop(['Date'],axis=1,inplace=True)
a.info()

In [0]:
a['Recovered']=df['Daily Recovered']
a['Total Recovered']=df['Total Recovered']
a.info()

In [0]:
a=a.round()

In [0]:
a.loc[98:,'Recovered']=predictions

In [0]:
a.info()

In [0]:
a=a.reset_index()
for i in range(6):
    a.loc[98+i,'Total Recovered']=a.loc[98+i-1,'Total Recovered']+a.loc[98+i,'Recovered']
a.info()    

In [0]:
a.tail()

# From ARIMA model total recovered cases would be 23768 approximately by 12/05/2020.

## Let's have a look at the Death rate.

In [0]:
df.tail()

In [0]:
death_diff = []

df1=df['2020-03-22':]
df1=df1.reset_index()

for i in range(1,len(df1)):
    death_diff.append(df1.loc[i,'Daily Deceased'] / df1.loc[i-1,'Daily Deceased'])

death_factor = sum(death_diff)/len(death_diff)
print('Average death factor',death_factor)

In [0]:
previous_day_death=df.loc['2020-05-06','Daily Deceased']
predicted_death = []

for i in range(15):
    predicted_value = previous_day_death *  death_factor
    predicted_death.append(predicted_value)
    previous_day_death = predicted_value
    

confirmed.loc[98:,'Daily Deceased']=predicted_death

In [0]:
confirmed.tail()

In [0]:
for i in range(15):
    confirmed.loc[98+i,'Total Deceased']=confirmed.loc[98+i-1,'Total Deceased']+confirmed.loc[98+i,'Daily Deceased']

In [0]:
confirmed.index=confirmed['date']
confirmed.drop(['date'],axis=1,inplace=True)

In [0]:
title='Covid-19 India'
ylabel='Persons'

ax = confirmed[['Total Recovered','Total Deceased']].iloc[-40:].plot(legend=True,figsize=(20,6),title=title,linestyle='-')
ax.autoscale(axis='x');
ax.set(ylabel=ylabel);

In [0]:
confirmed=confirmed.round()
confirmed.tail()

## If similar rate of death and recovery continues then by 21/05/2020

### Total death toll in India will reach 15000 mark

### Recovered patients would be 40000.

# ETS Decomposition for death rate.

In [0]:
df.info()

In [0]:
results = seasonal_decompose(df['Daily Deceased'])
results.plot();

In [0]:
results.seasonal.plot(figsize=(20,10));

In [0]:
results.trend.plot(figsize=(20,10));

In [0]:
# As the deceased cases started from 22/03/2020
df1=df['2020-03-22':]
len(df1)

In [0]:
train=df1.iloc[:41]
test=df1.iloc[41:]

Simple Exponential Smoothing

In [0]:
span = 7
alpha = 2/(span+1)

df1['EWMA7_deceased'] = df1['Daily Deceased'].ewm(alpha=alpha,adjust=False).mean()
df1['SES7_deceased']=SimpleExpSmoothing(df1['Daily Deceased']).fit(smoothing_level=alpha,optimized=False).fittedvalues.shift(-1)
df1.head()

In [0]:
df1[['Daily Deceased','EWMA7_deceased','SES7_deceased']].iloc[-14:].plot(figsize=(12,6)).autoscale(axis='x',tight=True);

Double Exponential Smoothing

In [0]:
# Double Exponential Smoothing

df1['DESadd7_deceased'] = ExponentialSmoothing(df1['Daily Deceased'], trend='add').fit().fittedvalues.shift(-1)
df1.head()

In [0]:
df1[['Daily Deceased','EWMA7_deceased','DESadd7_deceased']].iloc[-14:].plot(figsize=(12,6)).autoscale(axis='x',tight=True);

Here we can see that Double Exponential Smoothing is a much better representation of the time series data than Simple Exponential Smoothing. Let's see if using a multiplicative trend adjustment helps.

In [0]:
df1['DESmul7_deceased'] = ExponentialSmoothing(df1['Daily Deceased'], trend='mul').fit().fittedvalues.shift(-1)
df1[['Daily Deceased','DESadd7_deceased','DESmul7_deceased']].iloc[-14:].plot(figsize=(12,6)).autoscale(axis='x',tight=True);

## Triple Exponential Smoothing

Triple Exponential Smoothing, the method most closely associated with Holt-Winters, adds support for both trends and seasonality in the data.

In [0]:
df1['TESadd7_deceased'] = ExponentialSmoothing(df1['Daily Deceased'],trend='add',seasonal='add',seasonal_periods=7).fit().fittedvalues
df1.head()

In [0]:
df1[['Daily Deceased','TESadd7_deceased']].iloc[-14:].plot(figsize=(12,6)).autoscale(axis='x',tight=True);

In [0]:
df1['TESmul7_deceased'] = ExponentialSmoothing(df1['Daily Deceased'],trend='mul',seasonal='mul',seasonal_periods=7).fit().fittedvalues
df1[['Daily Deceased','TESadd7_deceased','TESmul7_deceased']].iloc[-14:].plot(figsize=(12,6)).autoscale(axis='x',tight=True);

Testing for stationarity of deceased data.

In [0]:
adf_test(df1['Daily Deceased'])

In [0]:
df1['d1_deceased'] = diff(df1['Daily Deceased'],k_diff=1)

adf_test(df1['d1_deceased'],'')

In [0]:
# Auto Arima Model

stepwise_model = auto_arima(train['Daily Deceased'], start_p=0, start_q=0, max_p=5, max_q=5,m=7,seasonality=True,
d=1,D=1,trace=True,error_action='ignore',suppress_warnings=True,stepwise=True)

print(stepwise_model.aic())

In [0]:
model = SARIMAX(train['Daily Deceased'],order=(0,1,2),seasonal_order=(0,1,1,7),enforce_invertibility=True)
results = model.fit()
results.summary()

In [0]:
start=len(train)
end=len(train)+len(test)-1
predictions = results.predict(start=start, end=end, dynamic=False,typ='levels').rename('SARIMA(0,1,2)(0,1,1,7) Predictions')

In [0]:
title='Covid-19 India Daily Deceased Cases'
ylabel='Persons'

ax = test['Daily Deceased'].plot(legend=True,figsize=(12,6),title=title)
predictions.plot(legend=True);
ax.autoscale(axis='x',tight=True);
ax.set(ylabel=ylabel);

In [0]:
error = rmse(test['Daily Deceased'], predictions)
print(f'SARIMAX(0,1,2)(0,1,1,7) RMSE Error: {error:11.10}')

Predicting for one week ahead.

In [0]:
model = SARIMAX(df1['Daily Deceased'],order=(0,1,2),seasonal_order=(0,1,1,7),enforce_invertibility=True)
results = model.fit()
results.summary()

In [0]:
start=len(df1)
end=len(df1)+5
predictions = results.predict(start=start, end=end, dynamic=False,typ='levels').rename('SARIMA(0,1,2)(0,1,1,7) Predictions')

In [0]:
predictions

In [0]:
title='Covid-19 India Daily Recovered Cases'
ylabel='Persons'

ax = df1['Daily Deceased'].plot(legend=True,figsize=(12,6),title=title)
predictions.plot(legend=True);
ax.autoscale(axis='x',tight=True);
ax.set(ylabel=ylabel);

In [0]:
date1 = '2020-01-30'
date2 = '2020-05-12'
mydates = pd.date_range(date1, date2).tolist()
len(mydates)

In [0]:
df.info()

In [0]:
a=pd.DataFrame()
a['Date']=mydates
a.info()

In [0]:
a.index=a['Date']
a.drop(['Date'],axis=1,inplace=True)

a['Deceased']=df['Daily Deceased']
a['Total Deceased']=df['Total Deceased']
a.info()

In [0]:
a=a.round()

a.loc[98:,'Deceased']=predictions

a.info()

In [0]:
a=a.reset_index()
for i in range(6):
    a.loc[98+i,'Total Deceased']=a.loc[98+i-1,'Total Deceased']+a.loc[98+i,'Deceased']
a.info()

In [0]:
a.tail()

# From ARIMA model total recovered cases would be 2546 approximately by 12/05/2020.