# PRCP-1023-JohnsHopkinsCovid19

## Introduction

 Here is the step by step procedure to  load, merge, clean and aggregate the COVID-19 time series data. The data was provided by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) who shared their data.
 The Johns Hopkins CSSE aggregates data from primary sources, such as the World Health Organisation, national, and regional public health institutions.
 
 The catastrophic outbreak of Severe Acute Respiratory Syndrome - Coronavirus (SARS-CoV-2) also known as COVID-2019 has brought the worldwide threat to the living society.The artificial intelligence researchers are focusing their expertise knowledge to develop mathematical models for analyzing this epidemic situation using nationwide shared data. To contribute towards the well-being of living society, this project proposes to utilize the machine learning and models with the aim for understanding its everyday exponential behaviour along with the prediction of future reachability of the COVID-2019 across the nations by utilizing the real-time information from the Johns Hopkins dashboard.

### Dataset description

The day to day prevalence data of COVID-2019 from January 22, 2020, to September 21, 2020, were retrieved from the official repository of Johns Hopkins University. The dataset consists of daily case reports and daily time series summary tables. In the present study, we have taken time-series summary tables in CSV format having three tables for confirmed, death and recovered cases of COVID-2019 with six attributes i.e. province/state, country/region, last update, confirmed, death and recovered cases, where the update frequency of the dataset is once in a day

In [1]:
import pandas as pd
import numpy as np


In [2]:
#Loading the Datasets

confirmed_df = pd .read_csv('time_series_covid19_confirmed_global.csv')

deaths_df = pd.read_csv('time_series_covid19_deaths_global.csv')

recovered_df = pd.read_csv('time_series_covid19_recovered_global.csv')


## Exploratory Data Analysis

In [3]:
confirmed_df.head() #To check first 5 rows

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,9/12/20,9/13/20,9/14/20,9/15/20,9/16/20,9/17/20,9/18/20,9/19/20,9/20/20,9/21/20
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,...,38641,38716,38772,38815,38855,38872,38883,38919,39044,39074
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,11185,11353,11520,11672,11816,11948,12073,12226,12385,12535
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,48007,48254,48496,48734,48966,49194,49413,49623,49826,50023
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,1344,1344,1438,1438,1483,1483,1564,1564,1564,1681
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,3335,3388,3439,3569,3675,3789,3848,3901,3991,4117


In [4]:
confirmed_df.columns

Index(['Province/State', 'Country/Region', 'Lat', 'Long', '1/22/20', '1/23/20',
       '1/24/20', '1/25/20', '1/26/20', '1/27/20',
       ...
       '9/12/20', '9/13/20', '9/14/20', '9/15/20', '9/16/20', '9/17/20',
       '9/18/20', '9/19/20', '9/20/20', '9/21/20'],
      dtype='object', length=248)

In [5]:
deaths_df.columns

Index(['Province/State', 'Country/Region', 'Lat', 'Long', '1/22/20', '1/23/20',
       '1/24/20', '1/25/20', '1/26/20', '1/27/20',
       ...
       '9/12/20', '9/13/20', '9/14/20', '9/15/20', '9/16/20', '9/17/20',
       '9/18/20', '9/19/20', '9/20/20', '9/21/20'],
      dtype='object', length=248)

In [6]:
recovered_df.columns

Index(['Province/State', 'Country/Region', 'Lat', 'Long', '1/22/20', '1/23/20',
       '1/24/20', '1/25/20', '1/26/20', '1/27/20',
       ...
       '9/12/20', '9/13/20', '9/14/20', '9/15/20', '9/16/20', '9/17/20',
       '9/18/20', '9/19/20', '9/20/20', '9/21/20'],
      dtype='object', length=248)

In [7]:
##Notice that columns are all date from the 4th column onwards and to get the list of dates


confirmed_df.columns[4:]

Index(['1/22/20', '1/23/20', '1/24/20', '1/25/20', '1/26/20', '1/27/20',
       '1/28/20', '1/29/20', '1/30/20', '1/31/20',
       ...
       '9/12/20', '9/13/20', '9/14/20', '9/15/20', '9/16/20', '9/17/20',
       '9/18/20', '9/19/20', '9/20/20', '9/21/20'],
      dtype='object', length=244)

## Merging Confirmed, Deaths and Recovered

Before merging, we need to use melt() to unpivot DataFrames from current wide format into long format. In other words, we are kinda transposing all date columns into values. Here are the main settings for that:

Use ‘Province/State’, ‘Country/Region’, ‘Lat’, ‘Long’ as identifier variables. We will later use them for merging.
Unpivot date columns (As we saw previously columns[4:]) with variable column ‘Date’ and value column ‘Confirmed’.

In [8]:
#Before merging, we need to use melt() to unpivot DataFrames from current wide format into long format.
#In other words, we are kinda transposing all date columns into values

dates = confirmed_df.columns[4:]
confirmed_df_long = confirmed_df.melt(
    id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], 
    value_vars=dates, 
    var_name='Date', 
    value_name='Confirmed'
)
deaths_df_long = deaths_df.melt(
    id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], 
    value_vars=dates, 
    var_name='Date', 
    value_name='Deaths'
)
recovered_df_long = recovered_df.melt(
    id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], 
    value_vars=dates, 
    var_name='Date', 
    value_name='Recovered'
)

In [9]:
#Above should return new long DataFrames. All of them are ordered by Date and Country/Region because raw data was
#already ordered by Country/Region and the date columns are already in ASC order.

#Here is the example of confirmed_df_long

confirmed_df_long 

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed
0,,Afghanistan,33.939110,67.709953,1/22/20,0
1,,Albania,41.153300,20.168300,1/22/20,0
2,,Algeria,28.033900,1.659600,1/22/20,0
3,,Andorra,42.506300,1.521800,1/22/20,0
4,,Angola,-11.202700,17.873900,1/22/20,0
...,...,...,...,...,...,...
64899,,West Bank and Gaza,31.952200,35.233200,9/21/20,36151
64900,,Western Sahara,24.215500,-12.885800,9/21/20,10
64901,,Yemen,15.552727,48.516388,9/21/20,2028
64902,,Zambia,-13.133897,27.849332,9/21/20,14175


In [10]:
#In addition, we have to remove recovered data for Canada
#due to mismatch issue ( Canada recovered data is counted by Country-wise rather than Province/State-wise).#

recovered_df_long = recovered_df_long[recovered_df_long['Country/Region']!='Canada']

In [11]:
#After that, we use merge() to merge the 3 DataFrames one after another

# Merging confirmed_df_long and deaths_df_long
full_table = confirmed_df_long.merge(
  right=deaths_df_long, 
  how='left',
  on=['Province/State', 'Country/Region', 'Date', 'Lat', 'Long']
)
# Merging full_table and recovered_df_long
full_table = full_table.merge(
  right=recovered_df_long, 
  how='left',
  on=['Province/State', 'Country/Region', 'Date', 'Lat', 'Long']
)

In [12]:
#Now, we should get a full table with Confirmed, Deaths and Recovered columns

full_table

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered
0,,Afghanistan,33.939110,67.709953,1/22/20,0,0,0.0
1,,Albania,41.153300,20.168300,1/22/20,0,0,0.0
2,,Algeria,28.033900,1.659600,1/22/20,0,0,0.0
3,,Andorra,42.506300,1.521800,1/22/20,0,0,0.0
4,,Angola,-11.202700,17.873900,1/22/20,0,0,0.0
...,...,...,...,...,...,...,...,...
64899,,West Bank and Gaza,31.952200,35.233200,9/21/20,36151,265,24428.0
64900,,Western Sahara,24.215500,-12.885800,9/21/20,10,1,8.0
64901,,Yemen,15.552727,48.516388,9/21/20,2028,586,1235.0
64902,,Zambia,-13.133897,27.849332,9/21/20,14175,331,13629.0


#  Performing Data Cleaning

## There are 3 tasks we would like to do

* Converting Date from string to datetime
*Replacing missing value NaN
*Coronavirus cases reported from 3 cruise ships should be treated differently
*You probably already notice that the values in the new Date column are all string with m/dd/yy format. To convert Date values from string to datetime, let’s use DataFrame.to_datetime()

In [13]:
#You probably already notice that the values in the new Date column are all string with m/dd/yy format.
#To convert Date values from string to datetime, 

# let’s use DataFrame.to_datetime()

full_table['Date'] = pd.to_datetime(full_table['Date'])


In [14]:
full_table

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered
0,,Afghanistan,33.939110,67.709953,2020-01-22,0,0,0.0
1,,Albania,41.153300,20.168300,2020-01-22,0,0,0.0
2,,Algeria,28.033900,1.659600,2020-01-22,0,0,0.0
3,,Andorra,42.506300,1.521800,2020-01-22,0,0,0.0
4,,Angola,-11.202700,17.873900,2020-01-22,0,0,0.0
...,...,...,...,...,...,...,...,...
64899,,West Bank and Gaza,31.952200,35.233200,2020-09-21,36151,265,24428.0
64900,,Western Sahara,24.215500,-12.885800,2020-09-21,10,1,8.0
64901,,Yemen,15.552727,48.516388,2020-09-21,2028,586,1235.0
64902,,Zambia,-13.133897,27.849332,2020-09-21,14175,331,13629.0


In [15]:
#Missing values NaN can be detected by running full_table.isna().sum()

full_table.isna().sum()

Province/State    45140
Country/Region        0
Lat                   0
Long                  0
Date                  0
Confirmed             0
Deaths                0
Recovered          4636
dtype: int64

In [16]:
full_table.dtypes

Province/State            object
Country/Region            object
Lat                      float64
Long                     float64
Date              datetime64[ns]
Confirmed                  int64
Deaths                     int64
Recovered                float64
dtype: object

####   We found a lot NaN in Province/State, and that makes sense as many countries only report the Country-wise data.
#### However, there are 1,602 NaNs in Recovered and let’s replace them with 0.



In [17]:
full_table['Recovered'] = full_table['Recovered'].fillna(0)

In [18]:
full_table.isna().sum()

Province/State    45140
Country/Region        0
Lat                   0
Long                  0
Date                  0
Confirmed             0
Deaths                0
Recovered             0
dtype: int64

In [19]:
full_table.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered
0,,Afghanistan,33.93911,67.709953,2020-01-22,0,0,0.0
1,,Albania,41.1533,20.1683,2020-01-22,0,0,0.0
2,,Algeria,28.0339,1.6596,2020-01-22,0,0,0.0
3,,Andorra,42.5063,1.5218,2020-01-22,0,0,0.0
4,,Angola,-11.2027,17.8739,2020-01-22,0,0,0.0


#### Apart from missing values, there are coronavirus cases reported from 3 cruise ships: Grand Princess, Diamond Princess and MS Zaandam. These data need to be extracted and treated differently due to Province/State and Country/Region mismatch over time. 

In [20]:
#And here is how we extract the ship data.

ship_rows = full_table['Province/State'].str.contains('Grand Princess')|full_table['Province/State'].str.contains('Diamond Princess') |full_table['Country/Region'].str.contains('Diamond Princess') |full_table['Country/Region'].str.contains('MS Zaandam')
full_ship = full_table[ship_rows]

In [21]:
full_ship

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered
41,Diamond Princess,Canada,0.0,0.0,2020-01-22,0,0,0.0
42,Grand Princess,Canada,0.0,0.0,2020-01-22,0,0,0.0
102,,Diamond Princess,0.0,0.0,2020-01-22,0,0,0.0
168,,MS Zaandam,0.0,0.0,2020-01-22,0,0,0.0
307,Diamond Princess,Canada,0.0,0.0,2020-01-23,0,0,0.0
...,...,...,...,...,...,...,...,...
64540,,MS Zaandam,0.0,0.0,2020-09-20,9,2,0.0
64679,Diamond Princess,Canada,0.0,0.0,2020-09-21,0,1,0.0
64680,Grand Princess,Canada,0.0,0.0,2020-09-21,13,0,0.0
64740,,Diamond Princess,0.0,0.0,2020-09-21,712,13,651.0


In [22]:
#And to get rid of ship data from full_table :

full_table = full_table[~(ship_rows)]

In [23]:
full_table

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered
0,,Afghanistan,33.939110,67.709953,2020-01-22,0,0,0.0
1,,Albania,41.153300,20.168300,2020-01-22,0,0,0.0
2,,Algeria,28.033900,1.659600,2020-01-22,0,0,0.0
3,,Andorra,42.506300,1.521800,2020-01-22,0,0,0.0
4,,Angola,-11.202700,17.873900,2020-01-22,0,0,0.0
...,...,...,...,...,...,...,...,...
64899,,West Bank and Gaza,31.952200,35.233200,2020-09-21,36151,265,24428.0
64900,,Western Sahara,24.215500,-12.885800,2020-09-21,10,1,8.0
64901,,Yemen,15.552727,48.516388,2020-09-21,2028,586,1235.0
64902,,Zambia,-13.133897,27.849332,2020-09-21,14175,331,13629.0


##  Data Aggregation



In [24]:
#So far, all the Confirmed, Deaths, Recovered are existing data from raw CSV dataset. 
#Let’s add an active cases column Active, which is calculated by active = confirmed — deaths — recovered .

full_table['Active'] = full_table['Confirmed'] - full_table['Deaths'] - full_table['Recovered']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  full_table['Active'] = full_table['Confirmed'] - full_table['Deaths'] - full_table['Recovered']


In [25]:
#And here is what full_table looks like now.


full_table

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered,Active
0,,Afghanistan,33.939110,67.709953,2020-01-22,0,0,0.0,0.0
1,,Albania,41.153300,20.168300,2020-01-22,0,0,0.0,0.0
2,,Algeria,28.033900,1.659600,2020-01-22,0,0,0.0,0.0
3,,Andorra,42.506300,1.521800,2020-01-22,0,0,0.0,0.0
4,,Angola,-11.202700,17.873900,2020-01-22,0,0,0.0,0.0
...,...,...,...,...,...,...,...,...,...
64899,,West Bank and Gaza,31.952200,35.233200,2020-09-21,36151,265,24428.0,11458.0
64900,,Western Sahara,24.215500,-12.885800,2020-09-21,10,1,8.0,1.0
64901,,Yemen,15.552727,48.516388,2020-09-21,2028,586,1235.0,207.0
64902,,Zambia,-13.133897,27.849332,2020-09-21,14175,331,13629.0,215.0


In [26]:
#Next, let’s aggregate data into Country/Region wise and group them by Date and Country/Region.
#sum() is to get the total count of ‘Confirmed’, ‘Deaths’, ‘Recovered’, ‘Active’ for the given Date and Country/Region.
#reset_index() reset the index and use the default one, which is Date and Country/Region.

full_grouped = full_table.groupby(['Date', 'Country/Region'])['Confirmed', 'Deaths', 'Recovered', 'Active'].sum().reset_index()

  full_grouped = full_table.groupby(['Date', 'Country/Region'])['Confirmed', 'Deaths', 'Recovered', 'Active'].sum().reset_index()


In [27]:
full_grouped

Unnamed: 0,Date,Country/Region,Confirmed,Deaths,Recovered,Active
0,2020-01-22,Afghanistan,0,0,0.0,0.0
1,2020-01-22,Albania,0,0,0.0,0.0
2,2020-01-22,Algeria,0,0,0.0,0.0
3,2020-01-22,Andorra,0,0,0.0,0.0
4,2020-01-22,Angola,0,0,0.0,0.0
...,...,...,...,...,...,...
45379,2020-09-21,West Bank and Gaza,36151,265,24428.0,11458.0
45380,2020-09-21,Western Sahara,10,1,8.0,1.0
45381,2020-09-21,Yemen,2028,586,1235.0,207.0
45382,2020-09-21,Zambia,14175,331,13629.0,215.0


In [28]:
#Now let’s add day wise New cases, New deaths and New recovered by deducting the corresponding accumulative data on the previous day.

# new cases 
temp = full_grouped.groupby(['Country/Region', 'Date', ])['Confirmed', 'Deaths', 'Recovered']
temp = temp.sum().diff().reset_index()
mask = temp['Country/Region'] != temp['Country/Region'].shift(1)
temp.loc[mask, 'Confirmed'] = np.nan
temp.loc[mask, 'Deaths'] = np.nan
temp.loc[mask, 'Recovered'] = np.nan
# renaming columns
temp.columns = ['Country/Region', 'Date', 'New cases', 'New deaths', 'New recovered']
# merging new values
full_grouped = pd.merge(full_grouped, temp, on=['Country/Region', 'Date'])
# filling na with 0
full_grouped = full_grouped.fillna(0)
# fixing data types
cols = ['New cases', 'New deaths', 'New recovered']
full_grouped[cols] = full_grouped[cols].astype('int')
# 
full_grouped['New cases'] = full_grouped['New cases'].apply(lambda x: 0 if x<0 else x)
#And finally here is the full_grouped. Be aware of that this final output is Country-wise data with

#Confirmed, Deaths, Recovered and Active are cumulative data.
#New cases, New deaths and New Recovered are day wise data.
#This DataFrames is ordered by Date and Country/Region.





  temp = full_grouped.groupby(['Country/Region', 'Date', ])['Confirmed', 'Deaths', 'Recovered']


In [29]:
full_grouped

Unnamed: 0,Date,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered
0,2020-01-22,Afghanistan,0,0,0.0,0.0,0,0,0
1,2020-01-22,Albania,0,0,0.0,0.0,0,0,0
2,2020-01-22,Algeria,0,0,0.0,0.0,0,0,0
3,2020-01-22,Andorra,0,0,0.0,0.0,0,0,0
4,2020-01-22,Angola,0,0,0.0,0.0,0,0,0
...,...,...,...,...,...,...,...,...,...
45379,2020-09-21,West Bank and Gaza,36151,265,24428.0,11458.0,465,3,728
45380,2020-09-21,Western Sahara,10,1,8.0,1.0,0,0,0
45381,2020-09-21,Yemen,2028,586,1235.0,207.0,2,0,8
45382,2020-09-21,Zambia,14175,331,13629.0,215.0,44,1,264


# Visualization

In [30]:

# Checking with selected country i.e India

!pip install altair
India = full_grouped[full_grouped['Country/Region'] == 'India']



In [31]:
import altair as alt
base = alt.Chart(India).mark_bar().encode(
    x='monthdate(Date):O',
).properties(
    width=500
)





In [32]:
red=alt.value('#f54242')
base.encode(y='Confirmed').properties(title='Total Confirmed')|base.encode(y='Deaths', color=red).properties(title='Total Deaths')

In [33]:
base = alt.Chart(India).mark_bar().encode(
    x='monthdate(Date):O',
).properties(
    width=1000
)


red=alt.value('#f54242')
base.encode(y='New cases').properties(title='Daily new cases')|base.encode(y='New deaths', color=red).properties(title='Daily new deaths')

In [34]:
# Insights from selected countries

countries = ['US', 'India', 'China', 'Spain', 'Germany', 'France', 'Iran', 'United Kingdom', 'Switzerland']
selected_countries = full_grouped[full_grouped['Country/Region'].isin(countries)]

In [35]:
selected_countries

Unnamed: 0,Date,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered
36,2020-01-22,China,548,17,28.0,503.0,0,0,0
61,2020-01-22,France,0,0,0.0,0.0,0,0,0
65,2020-01-22,Germany,0,0,0.0,0.0,0,0,0
78,2020-01-22,India,0,0,0.0,0.0,0,0,0
80,2020-01-22,Iran,0,0,0.0,0.0,0,0,0
...,...,...,...,...,...,...,...,...,...
45278,2020-09-21,Iran,425481,24478,361523.0,39480.0,3341,177,1953
45354,2020-09-21,Spain,671468,30663,150376.0,490429.0,31428,168,0
45359,2020-09-21,Switzerland,50378,2050,40500.0,7828.0,1095,5,0
45370,2020-09-21,US,6856884,199865,2615949.0,4041070.0,52070,356,25278


In [36]:
#Let’s create a circle chart to display the day wise New cases,


alt.Chart(selected_countries).mark_circle().encode(
    x='monthdate(Date):O',
    y='Country/Region',
    color='Country/Region',
    size=alt.Size('New cases:Q',
        scale=alt.Scale(range=[0, 1000]),
        legend=alt.Legend(title='Daily new cases')
    ) 
).properties(
    width=800,
    height=300
)

In [37]:
import plotly.express as px
from plotly.subplots import make_subplots

from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)

import warnings
warnings.filterwarnings('ignore')

In [38]:
#Grouping different types of cases as per the date
datewise=full_grouped.groupby(["Date"]).agg({"Confirmed":'sum',"Recovered":'sum',"Deaths":'sum'})
datewise["Days Since"]=datewise.index-datewise.index.min()

In [39]:
print("Basic Information")
print("Totol number of countries with Disease Spread: ",len(full_grouped["Country/Region"].unique()))
print("Total number of Confirmed Cases around the World: ",datewise["Confirmed"].iloc[-1])
print("Total number of Recovered Cases around the World: ",datewise["Recovered"].iloc[-1])
print("Total number of Deaths Cases around the World: ",datewise["Deaths"].iloc[-1])
print("Total number of Active Cases around the World: ",(datewise["Confirmed"].iloc[-1]-datewise["Recovered"].iloc[-1]-datewise["Deaths"].iloc[-1]))
print("Total number of Closed Cases around the World: ",datewise["Recovered"].iloc[-1]+datewise["Deaths"].iloc[-1])

Basic Information
Totol number of countries with Disease Spread:  186
Total number of Confirmed Cases around the World:  31245063
Total number of Recovered Cases around the World:  21259919.0
Total number of Deaths Cases around the World:  963677
Total number of Active Cases around the World:  9021467.0
Total number of Closed Cases around the World:  22223596.0


In [40]:
fig=px.bar(x=datewise.index,y=datewise["Confirmed"]-datewise["Recovered"]-datewise["Deaths"])
fig.update_layout(title="Distribution of Number of Active Cases",
                  xaxis_title="Date",yaxis_title="Number of Cases",)
fig.show()

In [41]:
#Growth rate of Confirmed, Recovered and Death Cases

import plotly.graph_objects as go

fig=go.Figure()
fig.add_trace(go.Scatter(x=datewise.index, y=datewise["Confirmed"],
                    mode='lines+markers',
                    name='Confirmed Cases'))
fig.add_trace(go.Scatter(x=datewise.index, y=datewise["Recovered"],
                    mode='lines+markers',
                    name='Recovered Cases'))
fig.add_trace(go.Scatter(x=datewise.index, y=datewise["Deaths"],
                    mode='lines+markers',
                    name='Death Cases'))
fig.update_layout(title="Growth of different types of cases",
                 xaxis_title="Date",yaxis_title="Number of Cases",legend=dict(x=0,y=1,traceorder="normal"))
fig.show()

### Moratality and Recovery Rate analysis around the World


In [42]:
#Calculating the Mortality Rate and Recovery Rate
datewise["Mortality Rate"]=(datewise["Deaths"]/datewise["Confirmed"])*100
datewise["Recovery Rate"]=(datewise["Recovered"]/datewise["Confirmed"])*100
datewise["Active Cases"]=datewise["Confirmed"]-datewise["Recovered"]-datewise["Deaths"]
datewise["Closed Cases"]=datewise["Recovered"]+datewise["Deaths"]

print("Average Mortality Rate",datewise["Mortality Rate"].mean())
print("Median Mortality Rate",datewise["Mortality Rate"].median())
print("Average Recovery Rate",datewise["Recovery Rate"].mean())
print("Median Recovery Rate",datewise["Recovery Rate"].median())

#Plotting Mortality and Recovery Rate 
fig = make_subplots(rows=2, cols=1,
                   subplot_titles=("Recovery Rate", "Mortatlity Rate"))
fig.add_trace(
    go.Scatter(x=datewise.index, y=(datewise["Recovered"]/datewise["Confirmed"])*100,name="Recovery Rate"),
    row=1, col=1
)
fig.add_trace(
    go.Scatter(x=datewise.index, y=(datewise["Deaths"]/datewise["Confirmed"])*100,name="Mortality Rate"),
    row=2, col=1
)
fig.update_layout(height=1000,legend=dict(x=-0.1,y=1.2,traceorder="normal"))
fig.update_xaxes(title_text="Date", row=1, col=1)
fig.update_yaxes(title_text="Recovery Rate", row=1, col=1)
fig.update_xaxes(title_text="Date", row=1, col=2)
fig.update_yaxes(title_text="Mortality Rate", row=1, col=2)
fig.show()

Average Mortality Rate 4.545539802217962
Median Mortality Rate 4.140046134580243
Average Recovery Rate 41.102317487568726
Median Recovery Rate 43.97207535164027


# Prediction using Machine Learning Models


In [43]:
import matplotlib.pyplot as plt
import datetime as dt
from datetime import timedelta
import seaborn as sns
from sklearn.linear_model import LinearRegression,Ridge,Lasso
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error,r2_score

# Linear Regression

In [44]:
datewise["Days Since"]=datewise.index-datewise.index[0]
datewise["Days Since"]=datewise["Days Since"].dt.days

# Splitting the data

train_ml=datewise.iloc[:int(datewise.shape[0]*0.95)]
valid_ml=datewise.iloc[int(datewise.shape[0]*0.95):]
model_scores=[]

#Fitting

lin_reg=LinearRegression(normalize=True)
lin_reg.fit(np.array(train_ml["Days Since"]).reshape(-1,1),np.array(train_ml["Confirmed"]).reshape(-1,1))


In [45]:
#Modelling

prediction_valid_linreg=lin_reg.predict(np.array(valid_ml["Days Since"]).reshape(-1,1))

model_scores.append(np.sqrt(mean_squared_error(valid_ml["Confirmed"],prediction_valid_linreg)))
print("Root Mean Square Error for Linear Regression: ",np.sqrt(mean_squared_error(valid_ml["Confirmed"],prediction_valid_linreg)))

Root Mean Square Error for Linear Regression:  7796533.683546501


In [46]:
plt.figure(figsize=(11,6))
prediction_linreg=lin_reg.predict(np.array(datewise["Days Since"]).reshape(-1,1))
linreg_output=[]
for i in range(prediction_linreg.shape[0]):
    linreg_output.append(prediction_linreg[i][0])

fig=go.Figure()
fig.add_trace(go.Scatter(x=datewise.index, y=datewise["Confirmed"],
                    mode='lines+markers',name="Train Data for Confirmed Cases"))
fig.add_trace(go.Scatter(x=datewise.index, y=linreg_output,
                    mode='lines',name="Linear Regression Best Fit Line",
                    line=dict(color='black', dash='dot')))
fig.update_layout(title="Confirmed Cases Linear Regression Prediction",
                 xaxis_title="Date",yaxis_title="Confirmed Cases",legend=dict(x=0,y=1,traceorder="normal"))
fig.show()

<Figure size 792x432 with 0 Axes>

#### Prediction for next 30 days

In [47]:
new_date=[]
new_prediction_lr=[]

for i in range(1,30):
    new_date.append(datewise.index[-1]+timedelta(days=i))
    new_prediction_lr.append(lin_reg.predict(np.array(datewise["Days Since"].max()+i).reshape(-1,1))[0][0])

    
    pd.set_option("display.float_format",lambda x: '%.f'%x)
    model_predictions=pd.DataFrame(zip(new_date,new_prediction_lr,),columns=["Dates","LR"])

In [48]:
model_predictions.head(30)

Unnamed: 0,Dates,LR
0,2020-09-22,22628709
1,2020-09-23,22744331
2,2020-09-24,22859953
3,2020-09-25,22975576
4,2020-09-26,23091198
5,2020-09-27,23206820
6,2020-09-28,23322442
7,2020-09-29,23438065
8,2020-09-30,23553687
9,2020-10-01,23669309


# Polynomial Regression 

In [49]:
from sklearn.preprocessing import PolynomialFeatures


train_ml=datewise.iloc[:int(datewise.shape[0]*0.95)]
valid_ml=datewise.iloc[int(datewise.shape[0]*0.95):]

poly = PolynomialFeatures(degree = 8) 
train_poly=poly.fit_transform(np.array(train_ml["Days Since"]).reshape(-1,1))
valid_poly=poly.fit_transform(np.array(valid_ml["Days Since"]).reshape(-1,1))
y=train_ml["Confirmed"]

linreg=LinearRegression(normalize=True)
linreg.fit(train_poly,y)

In [50]:
prediction_poly=linreg.predict(valid_poly)
rmse_poly=np.sqrt(mean_squared_error(valid_ml["Confirmed"],prediction_poly))
model_scores.append(rmse_poly)
print("Root Mean Squared Error for Polynomial Regression: ",rmse_poly)

Root Mean Squared Error for Polynomial Regression:  884875.1371881566


In [51]:
comp_data=poly.fit_transform(np.array(datewise["Days Since"]).reshape(-1,1))
plt.figure(figsize=(11,6))
predictions_poly=linreg.predict(comp_data)

fig=go.Figure()
fig.add_trace(go.Scatter(x=datewise.index, y=datewise["Confirmed"],
                    mode='lines+markers',name="Train Data for Confirmed Cases"))
fig.add_trace(go.Scatter(x=datewise.index, y=predictions_poly,
                    mode='lines',name="Polynomial Regression Best Fit",
                    line=dict(color='black', dash='dot')))
fig.update_layout(title="Confirmed Cases Polynomial Regression Prediction",
                 xaxis_title="Date",yaxis_title="Confirmed Cases",
                 legend=dict(x=0,y=1,traceorder="normal"))
fig.show()

<Figure size 792x432 with 0 Axes>

#### Prediction for next 30 days

In [52]:
new_prediction_poly=[]
for i in range(1,31):
    new_date_poly=poly.fit_transform(np.array(datewise["Days Since"].max()+i).reshape(-1,1))
    new_prediction_poly.append(linreg.predict(new_date_poly)[0])
    
    pd.set_option("display.float_format",lambda x: '%.f'%x)
    model_predictions=pd.DataFrame(zip(new_date,new_prediction_poly),columns=["Dates","POLY"])


In [53]:
model_predictions.head(30)

Unnamed: 0,Dates,POLY
0,2020-09-22,33637529
1,2020-09-23,34255078
2,2020-09-24,34911134
3,2020-09-25,35608680
4,2020-09-26,36350864
5,2020-09-27,37141000
6,2020-09-28,37982580
7,2020-09-29,38879281
8,2020-09-30,39834964
9,2020-10-01,40853689


# Support Vector Machine

In [54]:
datewise["Days Since"]=datewise.index-datewise.index[0]
datewise["Days Since"]=datewise["Days Since"].dt.days

train_ml=datewise.iloc[:int(datewise.shape[0]*0.95)]
valid_ml=datewise.iloc[int(datewise.shape[0]*0.95):]
model_scores=[]

svm=SVR(C=1,degree=5,kernel='poly',epsilon=0.001)

svm.fit(np.array(train_ml["Days Since"]).reshape(-1,1),np.array(train_ml["Confirmed"]).reshape(-1,1))


In [55]:
prediction_valid_svm=svm.predict(np.array(valid_ml["Days Since"]).reshape(-1,1))

model_scores.append(np.sqrt(mean_squared_error(valid_ml["Confirmed"],prediction_valid_svm)))
print("Root Mean Square Error for Support Vectore Machine: ",np.sqrt(mean_squared_error(valid_ml["Confirmed"],prediction_valid_svm)))


Root Mean Square Error for Support Vectore Machine:  14789024.714558557


In [56]:
plt.figure(figsize=(11,6))
prediction_linreg=lin_reg.predict(np.array(datewise["Days Since"]).reshape(-1,1))
linreg_output=[]
for i in range(prediction_linreg.shape[0]):
    linreg_output.append(prediction_linreg[i][0])

fig=go.Figure()
fig.add_trace(go.Scatter(x=datewise.index, y=datewise["Confirmed"],
                    mode='lines+markers',name="Train Data for Confirmed Cases"))
fig.add_trace(go.Scatter(x=datewise.index, y=linreg_output,
                    mode='lines',name="Linear Regression Best Fit Line",
                    line=dict(color='black', dash='dot')))
fig.update_layout(title="Confirmed Cases Linear Regression Prediction",
                 xaxis_title="Date",yaxis_title="Confirmed Cases",legend=dict(x=0,y=1,traceorder="normal"))
fig.show()

<Figure size 792x432 with 0 Axes>

#### Prediction for next 30 days

In [57]:

new_date=[]
new_prediction_svm=[]

for i in range(1,30):
    new_date.append(datewise.index[-1]+timedelta(days=i))
    new_prediction_svm.append(svm.predict(np.array(datewise["Days Since"].max()+i).reshape(-1,1))[0])

    
    pd.set_option("display.float_format",lambda x: '%.f'%x)
    model_predictions=pd.DataFrame(zip(new_date,new_prediction_svm),columns=["Dates","SVM"])


In [58]:
    model_predictions.head(30)

Unnamed: 0,Dates,SVM
0,2020-09-22,16409224
1,2020-09-23,16658208
2,2020-09-24,16911290
3,2020-09-25,17168520
4,2020-09-26,17429950
5,2020-09-27,17695631
6,2020-09-28,17965615
7,2020-09-29,18239953
8,2020-09-30,18518697
9,2020-10-01,18801902


# Conclusion

Machine learning algorithms play an important role in epidemic analysis and forecasting .In the presence of massive epidemic data, the machine learning techniques help to find the epidemic patterns so that the early action can be planned to stop the spread of the virus.
In this project,machine learning are used to observe everyday behaviour along,with the prediction of future reachability of the COVID-2019 across the nation by utilizing the real-time information from the Johns Hopkins dashboard.

Here we have utalised 3 types of machine learning algorithms,ie,Linear Regression, Polynomial Regression and Support Vector Machine learning algorithm.The results show that polynomial regression (PR) yielded a minimum root mean square error (RMSE) score over other approaches in forecasting the COVID-19 transmission.However, if the spread follows the predicted trend of the PR model then it would lead to huge loss of lives as it presents the exponential growth of the transmission worldwide.

The world is under the grasp of SARS-CoV2 (COVID-19) virus. Early prediction of the transmission can help to take necessary actions.As observed from China, this growth of the COVID-19 can be reduced and quenched by reducing the number of susceptible individuals from the infected individuals. This is achievable by becoming unsocial and following the lockdown initiative with discipline.
