# CoVid 19 - Prediction using VAR model

****The objective of this notebook is to generate a time series model using VAR to predict the confirmed ,samples tested and positive cases for each state for the upcoming days.****

In [17]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

In [18]:
df = pd.read_csv('covid_19_india.csv')
df.head()

Unnamed: 0,Sno,Date,Time,State/UnionTerritory,ConfirmedIndianNational,ConfirmedForeignNational,Cured,Deaths,Confirmed
0,1,30/01/20,6:00 PM,Kerala,1,0,0,0,1
1,2,31/01/20,6:00 PM,Kerala,1,0,0,0,1
2,3,01/02/20,6:00 PM,Kerala,2,0,0,0,2
3,4,02/02/20,6:00 PM,Kerala,3,0,0,0,3
4,5,03/02/20,6:00 PM,Kerala,3,0,0,0,3


## Data Wrangling

In [19]:
df = df.dropna()

In [20]:
df = df.replace({'Telangana':'Telengana'})

Date and time conversion to make the dataset as a time-series data

In [21]:
df['Date'] = pd.to_datetime(df.Date,dayfirst=True).dt.strftime('%Y-%m-%d')
df['Time'] = pd.to_datetime(df['Time']).dt.strftime('%H:%M:%S')
df.tail()

Unnamed: 0,Sno,Date,Time,State/UnionTerritory,ConfirmedIndianNational,ConfirmedForeignNational,Cured,Deaths,Confirmed
4666,4667,2020-07-30,08:00:00,Telengana,-,-,43751,492,58906
4667,4668,2020-07-30,08:00:00,Tripura,-,-,2678,21,4485
4668,4669,2020-07-30,08:00:00,Uttarakhand,-,-,3811,72,6866
4669,4670,2020-07-30,08:00:00,Uttar Pradesh,-,-,45807,1530,77334
4670,4671,2020-07-30,08:00:00,West Bengal,-,-,44116,1490,65258


In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4671 entries, 0 to 4670
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Sno                       4671 non-null   int64 
 1   Date                      4671 non-null   object
 2   Time                      4671 non-null   object
 3   State/UnionTerritory      4671 non-null   object
 4   ConfirmedIndianNational   4671 non-null   object
 5   ConfirmedForeignNational  4671 non-null   object
 6   Cured                     4671 non-null   int64 
 7   Deaths                    4671 non-null   int64 
 8   Confirmed                 4671 non-null   int64 
dtypes: int64(4), object(5)
memory usage: 364.9+ KB


In [23]:
df = df.loc[df['State/UnionTerritory'] != 'Unassigned']
df = df.loc[df['State/UnionTerritory'] != 'Cases being reassigned to states']

## Numerical variable dependency

Based on the graph, the rate of curing a person will be more when compared to the rate of deaths caused in a state. Likewise, the rate of confirmed cases will be increasing more than the rate at which the persons are cured.

Since the recent count of the cases is recorded on the current date. The last record of each dataset is considered for visualisation

In [24]:
df_u = df['State/UnionTerritory'].unique()
df3 = pd.DataFrame()
for s in (df_u):
                l = df.loc[df['State/UnionTerritory'] == s]
                l1= l['Sno'].idxmax(axis=1)
                l2 = l.loc[l.index == l1]
                df3 = df3.append([l2])
df3.head()                                            
                                           
                                                                

Unnamed: 0,Sno,Date,Time,State/UnionTerritory,ConfirmedIndianNational,ConfirmedForeignNational,Cured,Deaths,Confirmed
4652,4653,2020-07-30,08:00:00,Kerala,-,-,11365,68,21797
4666,4667,2020-07-30,08:00:00,Telengana,-,-,43751,492,58906
4644,4645,2020-07-30,08:00:00,Delhi,-,-,118633,3907,133310
4663,4664,2020-07-30,08:00:00,Rajasthan,-,-,27569,650,38964
4669,4670,2020-07-30,08:00:00,Uttar Pradesh,-,-,45807,1530,77334


In [25]:
df3 = df3.reset_index()
df3 = df3.drop(columns=['index','Sno'])
df3 = df3[['State/UnionTerritory','Confirmed','Deaths','Cured']]
df3 = df3.loc[df3['State/UnionTerritory'] != 'Unassigned']
df3 = df3.loc[df3['State/UnionTerritory'] != 'Cases being reassigned to states']
df3.head()

Unnamed: 0,State/UnionTerritory,Confirmed,Deaths,Cured
0,Kerala,21797,68,11365
1,Telengana,58906,492,43751
2,Delhi,133310,3907,118633
3,Rajasthan,38964,650,27569
4,Uttar Pradesh,77334,1530,45807


Calculation of the total cases in each state

In [26]:
df3['Total Cases'] = (df3['Confirmed'] + df3['Deaths'] + df3['Cured']).astype(int)
df3.head()


Unnamed: 0,State/UnionTerritory,Confirmed,Deaths,Cured,Total Cases
0,Kerala,21797,68,11365,33230
1,Telengana,58906,492,43751,103149
2,Delhi,133310,3907,118633,255850
3,Rajasthan,38964,650,27569,67183
4,Uttar Pradesh,77334,1530,45807,124671


## Total cases in each state in India

From the graph, Delhi,TN,Maharashtra and Gujarat are having more than 1L cases till date. While other states have minimal cases.

## Rate of increase in confirmed cases based on Date

The above graph is a time-series graph depicting the confirmed cases until date. Most of the states have a steady / exponential increase in the number of confirmed cases. 

## Samples Tested

In [27]:
df_testing = pd.read_csv('StatewiseTestingDetails.csv')
df_testing = df_testing.fillna(0)
df_testing['Date'] = pd.to_datetime(df_testing['Date']).dt.strftime('%Y-%m-%d')
df_testing.head()

Unnamed: 0,Date,State,TotalSamples,Negative,Positive
0,2020-04-17,Andaman and Nicobar Islands,1403.0,1210,12.0
1,2020-04-24,Andaman and Nicobar Islands,2679.0,0,27.0
2,2020-04-27,Andaman and Nicobar Islands,2848.0,0,33.0
3,2020-05-01,Andaman and Nicobar Islands,3754.0,0,33.0
4,2020-05-16,Andaman and Nicobar Islands,6677.0,0,33.0


## Total samples collected everyday in each state

The sample collection rate in most of the states on an every day basis is increasing exponentially. However, this feature will be suitable for prediction only if 87% of the samples get tested and are categorised as 'positive' or 'negative'. For some states, the sample collection rate is less and hence, it would be difficult to predict the testing rate using the ML model.

## Positive cases tested everyday in each state

For every state, the number of samples tested provides a considerable percentage of positive cases. For example, in case of TN, out of 6lakh samples, 5% of the samples were positive. On the other hand, some states have a constant positive case rate which means there were no positive cases for few days.

An inference from the above observations is that testing samples on a daily basis has certainly helped in detecting a considerable amount of positive cases. 

## Merging testing details with the main covid-19 dataset

This merging is done to build a model that will predict the covid 19 cases based on the samples tested and positive/ negative cases observed everyday.

In [28]:
df4 = pd.DataFrame(columns=[])
for i in df_u:
            state = df.loc[df['State/UnionTerritory'] == i]
            state1 = df_testing.loc[df_testing['State'] == i]
            
            for j in state['Date']:
                                        t = state1.loc[state1['Date'] == j]
                                        t1 = state.loc[state['Date'] == j]
                                        df4 = df4.append(t1.merge(t,how='outer',on=['Date']))

                                       
                             

In [29]:
df4 = df4.drop(columns=['Time','ConfirmedIndianNational','ConfirmedForeignNational','State'],axis=1)
df4 = df4.fillna(0)
df4 = df4.reset_index()
df4 = df4.drop(columns = ['index','Sno'],axis=1)
df4

Unnamed: 0,Date,State/UnionTerritory,Cured,Deaths,Confirmed,TotalSamples,Negative,Positive
0,2020-01-30,Kerala,0,0,1,0.0,0,0.0
1,2020-01-31,Kerala,0,0,1,0.0,0,0.0
2,2020-02-01,Kerala,0,0,2,0.0,0,0.0
3,2020-02-02,Kerala,0,0,3,0.0,0,0.0
4,2020-02-03,Kerala,0,0,3,0.0,0,0.0
...,...,...,...,...,...,...,...,...
4604,2020-07-28,Dadra and Nagar Haveli and Daman and Diu,564,2,946,41588.0,40258,1020.0
4605,2020-07-29,Dadra and Nagar Haveli and Daman and Diu,596,2,982,41895.0,40411,1058.0
4606,2020-07-30,Dadra and Nagar Haveli and Daman and Diu,648,2,1026,0.0,0,0.0
4607,2020-07-26,Telangana***,40334,455,52466,0.0,0,0.0


## Correlation between the numerical parameters

In [30]:
coff = df4.corr()
coff[['Confirmed']]

Unnamed: 0,Confirmed
Cured,0.983507
Deaths,0.940308
Confirmed,1.0
TotalSamples,0.767535
Positive,0.990258


Based on the correlation coefficient, it is best to use Total samples, Positive and Negative features as independant variables. While cured and death feature provides the perfect correlation coefficient, it can't be used in predicting the confirmed cases since those two features are dependent on the latter.

## Model Development

Since the dataset is a time-series data, Vector Auto Regression (VAR) is used for future forecasting. VAR model is used due to multiple features that needs to be predicted such as Confirmed,Total samples tested, positive and negative cases.

In [31]:
from sklearn.utils import shuffle
df5 = shuffle(df4)
df5 = df5.reset_index()
df5= df5.drop(columns=['index'],axis=1)
df5 = df5.sort_values(by='Date')
df5

Unnamed: 0,Date,State/UnionTerritory,Cured,Deaths,Confirmed,TotalSamples,Negative,Positive
1675,2020-01-30,Kerala,0,0,1,0.0,0,0.0
2716,2020-01-31,Kerala,0,0,1,0.0,0,0.0
1556,2020-02-01,Kerala,0,0,2,0.0,0,0.0
2636,2020-02-02,Kerala,0,0,3,0.0,0,0.0
3205,2020-02-03,Kerala,0,0,3,0.0,0,0.0
...,...,...,...,...,...,...,...,...
890,2020-07-30,Delhi,118633,3907,133310,1013694.0,0,134403.0
883,2020-07-30,Nagaland,595,5,1513,37156.0,0,1566.0
1663,2020-07-30,Maharashtra,239755,14463,400651,2075528.0,1656194,419334.0
1530,2020-07-30,West Bengal,44116,1490,65258,0.0,0,0.0


## Prediction using VAR Model

Before training the model, johansen test is performed to inspect the stationarity of the dataset.After the test, all the eigen values for the respective features are less than 1. Hence, the dataset is stationary and no further differncing/intergration is required.

Further, the VAR model is trained with the data of a single state and the same is used to predict the cases for the upcoming days.

The below model is an interactive one where the user can provide the state and number of days of prediction required. Though, the model can predict for n number of days, it is hard to represent the large data in the graph. Hence, a maximum of 10 to 20 days can be sent as input for a better visualisation.

***Note***: Telengana,Daman&Diu,Dadar and Nagar Haveli has no testing sample data. Hence, forecast is not predicted for these states alone. Also, for few states, the number of samples tested is zero. Hence, the Johansen test will fail and predictions cant be made.

**The interactive model will run only when Kernal is active**

From the forecast model, it is understood that in the forthcoming week, most of the states have a steady increase in confirmed cases. While few states such as Karnataka shows a steep decline in the cases. Gujarat shows a decrease in initial days but it increases steadily after 2 days.

## Interactive Map Viz using Folium

The interact mode works only when Kernal is active. The purpose is to view different count for each of the dates for which the forecast has been done.

## Inference

Most of the states have a steady increase in the cases for this particular week. With better testing rates, the confirmed cases are identified more easily.

In [35]:
df5.to_csv('C:/Users/Smizzy/Documents/data science notes/covid19/trained.csv')

In [37]:
def forecastmodel(State,days):
                                df5= pd.read_csv('trained.csv')
                                df6 = df5.loc[df5['State/UnionTerritory'] == State]
                                df6 = df6[['Date','Confirmed','TotalSamples','Positive']]
                                df6.index = df6['Date']
                                df6 = df6.drop(columns=['Date'])
                                df6_upd = df6.loc[df6.index != df6.index.max()]
                                
                        # Fit the exisiting data trends to the forecast model
                                from statsmodels.tsa.vector_ar.vecm import coint_johansen
                                jtest = coint_johansen(df6_upd,1,1);
                                from statsmodels.tsa.vector_ar.var_model import VAR
                                m = VAR(df6_upd);
                                model = m.fit();        
                                valid_pred = model.forecast(model.y,steps=days);
                                df7 = pd.DataFrame(valid_pred.round(0),columns=[['Confirmed','TotalSamples','Positive']])
                                df7['Date'] = pd.date_range(df6.index.max(),periods=days);
                                df7 =df7[['Date','Confirmed','TotalSamples','Positive']]
                              
        
                                
                                return df7
                             