# Estimate Current Number of Covid19 Cases and Predict Future Number of Covid19 Cases/Deaths Based on Current Death Rate Data

Covid19 was first reported in China during December 2019.  On January 23, 2020 China issued lockdown orders for Wuhan and three other cities as the virus spread.  The lockdown was extended on January 24, 2020.  On January 30,2020, the World Health Organization (WHO) declared Covid19 a public health emergency of international concern (PHEIC). On February 25, 2020, Italy put the Lombardy region on lockdown due to Covid19.  One March 9, 2020, Italy expanded the lockdown to include the entire country.  On March 11, 2020, WHO declared Covid19 a global pandemic. Since the first reported case in Dec 2019, approximately three months ago, Covid19 has spread to 167 countries and regions. The United States has started to see widespread cases of Covid19, especially in New York, California and Oregon. Other countries have also been experiencing large numbers of Covid19 infections, specifically Spain, Germany, Iran and France. Many countries have introduced measures to control the virus such as closing schools, promoting social distancing, handwashing, quarantines and lockdowns.  It is likely that these measures have slowed the spread of the virus although there are concerns that it is not enough and that hospitals worldwide are going to overwhelmed with large numbers of Covid19 patients. Most officials seem to agree that the virus is going to keep spreading and that complete containment is not possible at this point. The goal at this point is to slow down the spread of the virus so that hospitals aren't overwhelmed with large numbers of Covid19 patients.

This model/analysis has two goals: 1) Estimate the number of cases in each country/region since the virus first started spreading and 2) Predict the numbers of future cases as well as the number of deaths for each country/region. 

The first goal is to estimate the number of cases in each country/region since the virus was first detected in each area.  This estimate is important because the current number of positive cases is unknown in many areas due to lack of widespread testing. Some countries (such as South Korea) instituted widespread drive through testing centers and these countries likely have a better estimate of the spread of the virus. However, other countries have not instituted widespread testing and so the true extent of the spread of the virus is unknown.  The estimated number of cases is calculated with the reported death rate. The death rate is likely more accurate than the number of cases and can give a rough estimate of the actual number of cases.  The death rate is used to calculate the estimated cases by first assuming a certain mortality rate, and time frame that someone dies after contracting Covid19.  For example if the mortality rate is 1% and it takes 25 days for someone to die after they are first exposed, then it is known that 25 days ago 100 people were positive for each reported death. The current number of positive cases can then be estimated based on exponential growth patterns. If 100 people were positive 25 days ago for each reported death, and if the doubling rate is 5 days, then it can be estimated that 1600 people are currently infected for each reported death. 

| Days | Cases|
|------|------|
|   0  | 100  |
|   5  | 200  |
|   10 | 400  |
|   15 | 800  |
|   20 | 1600 |
|   25 | 3200 |

The second goal is to predict the future cases/deaths based on the number of cases estimated during the first goal. The model used to describe the data calculates the change in concentration (cases, deaths etc) with time using the following equation: 

$\frac{dC_A}{dt} = k C_A^{\alpha}$

There have been many papers and early estimates of the models parameters published either in journals are simply online such as the mortality rate, doubling rate, $R_o$, time from infection to becoming symptomatic, and time from displaying first symptoms to death. These parameters vary by country and also in some cases with time. For instance, the mortality rate in Italy has appeared to recently increase. It's unclear if this is because the data is early, there are less positive cases being reported, or if the mortality rate has actually increased. Some countries have had extremely low mortality rates while others have had higher mortality rates. Other parameters such as the growth factor have been difficult to estimate for many countries because data is still sparse.  It is likely that in a few weeks the growth factor will become more clear in many countries/regions. The growth factor is important because if it is calculated as a function of time it is possible to tell when the virus is on the decline. So if the growth factor is greater than one the virus is still increasing, if it is equal to one it has stabalized and if it is less than one it is declining (upper portion of logistic curve).  All current data suggests that most countries currently have a growth factor greater than one (exceptions are China and South Korea) but with sparse data it is difficult to calculate a reliable growth factor. The parameters used in the model are estimated from available data and/or taken from available literature that is becoming more prevalent everyday. 

Keep in mind that many assumptions go into this model and that the "estimated" and "predicted" results are just one scenario. This virus is spreading quickly and as a result data, parameters, and models are evolving daily. 

Thank you to John Hopkins University for making the data set public.  

## Import Modules for model

In [None]:
from __future__ import print_function
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

import numpy as np
import pandas as pd
import datetime as dt
from datetime import date

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import folium #to create maps


from sklearn.linear_model import LinearRegression
from sklearn import metrics

#for nonlinear regression
from scipy.optimize import curve_fit



Load file with function definitions

In [None]:
#%%writefile covid19_ascending_model_functions.py

# %load covid19_ascending_model_functions.py
"""
Created on Thu Mar 26 14:18:40 2020

@author: Raili
"""

#Date is transformed into a column. This makes plotting the time series data easeier. 

def add_date_column(df_raw):
    df_date_column = df_raw.melt(id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], value_name='Cases', var_name='Date')
    df_date_column = df_date_column.set_index(['Country/Region', 'Province/State', 'Date'])
    return df_date_column

#function that takes raw data for confirmed, deaths and recovered. Returns dataframe with 
#current accumulated data for confirmed, deaths and recovered.  Basically, it removes
#dates and returns totals for each country at the most recently reported data (last column)

#example function call
#accum_data_df = accum_data(raw_data_confirmed, raw_data_recovered, raw_data_deaths)
#raw_data_confirmed (and others) have following header:
#Province/State, Country/Region, Lat, Long, 1/22/20,...3/18/20

#Dataframe that is returned has following header:
#Province/State, Country, Lat, Long, Total Confirmed, Total Deaths, Total Recovered
def accumulated_data(confirmed, deaths):
        
    accum_data_df = confirmed[['Province/State', 'Country/Region', 'Lat', 'Long']]
    
    accum_data_df['Total Confirmed'] = confirmed.iloc[ :, -1]
    accum_data_df['Total Deaths'] = deaths.iloc[ :, -1]
   # accum_data_df['Total Recovered'] = recovered.iloc[ :, -1]

    return accum_data_df


### Get DailyData from Cumulative sum
def daily_data(country_time_df,old_name,new_name):
    accum_country_daily=country_time_df.groupby(level=0).diff().fillna(0)
    accum_country_daily=accum_country_daily.rename(columns={old_name:new_name})
    return accum_country_daily

#Function that takes data with date column and returns a dataframe with countries, dates, confirmed, deaths and recovered.
#type_data is either Country/Region or Province/State. 
#Example function call: 
#timeseries_country_data(confirmed_date_column,'Cases','Total Confirmed Cases', 'Country')
#confirmed_date_column has following header:
#Country/Region, Province/State, Date, Lat, Long, Cases

#Dateframe that is returned has following header:
#Country/Region, Date, ......, newname

def timeseries_data(_time_df,old_name,new_name, type_data):
        
    timeseries_df=_time_df.groupby([type_data,'Date'])['Cases'].sum().reset_index()
    timeseries_df=timeseries_df.set_index([type_data,'Date'])
    timeseries_df.index=timeseries_df.index.set_levels([timeseries_df.index.levels[0], pd.to_datetime(timeseries_df.index.levels[1])])
    timeseries_df=timeseries_df.sort_values([type_data,'Date'],ascending=True)
    timeseries_df=timeseries_df.rename(columns={old_name:new_name})
    return timeseries_df

#Function that takes data with date column and returns a dataframe with states, dates, confirmed, deaths and recovered. 
#Example function call: 
#timeseries_state_data(confirmed_date_column, country, 'Cases','Total Confirmed Cases')
#country is the country that you want state data
#confirmed_date_column has following header:
#Country/Region, Province/State, Date, Lat, Long, Cases

#Dateframe that is returned has following header:
#State, Date, ......, newname

def timeseries_state_data(state_time_df,old_name,new_name):
    state_time_df=state_time_df.groupby(['Province/State','Date'])['Cases'].sum().reset_index()
    state_time_df=state_time_df.set_index(['Province/State','Date'])
    state_time_df.index=state_time_df.index.set_levels([state_time_df.index.levels[0], pd.to_datetime(state_time_df.index.levels[1])])
    state_time_df=state_time_df.sort_values(['Province/State','Date'],ascending=True)
    state_time_df=state_time_df.rename(columns={old_name:new_name})
    
    return state_time_df

#function to clean the data
def clean_data(df_to_clean):
    
    # replace mainland china with China
    df_to_clean['Country/Region'] = df_to_clean['Country/Region'].replace('Mainland China', 'China')
    
    return df_to_clean
    
    
#-----------------------------------------------------------------------------------------------------------------
#function calculates likely positive cases based on current death rate, mortality rate, and doubling time
#consolidated_df: dataframe that includes columns named 'Daily New Deaths'

def estimate_current_cases(consolidated_df, days_to_symptoms, days_after_symptoms, mortality_rate, doubling_time, place):
    
    death_time = days_to_symptoms + days_after_symptoms

    #calculate estimated people who have virus at the time of the death. For instance if death_time is 
    #20 days this calculation shows the possible number of positive cases 20 days previously.
    #Since these are calculated for the number of days before death shift data in dataframe back by death_time
    #days.
    estimated_past_cases = consolidated_df.copy()
    estimated_past_cases = estimated_past_cases.loc[:,['Daily New Deaths']]
    estimated_past_cases = np.round(estimated_past_cases['Daily New Deaths']/mortality_rate,3)

    #calculate possible number of cases at current date with the doubling rate. 
    #estimated_current_cases = consolidated_df.copy()
    #estimated_current_cases = estimated_current_cases.loc[:,['Daily New Deaths']]
    estimated_current_cases = np.round(estimated_past_cases*2**(death_time/doubling_time))

    #Estimated past cases were calculated for the number of days death_time previously. They need to be shifted back
    #by this number of days. 
    estimated_past_cases = estimated_past_cases.groupby([place]).shift(-death_time)

    #Combine estimated past cases with estimated current cases
    data_to_combine = [estimated_past_cases, estimated_current_cases]
    headers = ['Estimated Past Cases', 'Estimated Current Cases']
    estimated_cases = pd.concat(data_to_combine, axis=1, keys=headers)
    
    return estimated_cases


## Import data

In [None]:
url_confirmed = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'
url_deaths = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
#url_recovered = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Recovered.csv'

raw_data_confirmed = pd.read_csv(url_confirmed)
raw_data_deaths = pd.read_csv(url_deaths)
#raw_data_recovered = pd.read_csv(url_recovered)


# Visualize

### Summary Table

In [None]:
#Combine data into a table with accumulated data. This is basically the last column of the 
#raw data set for confirmed cases, recovered cases and deaths. 
accum_data_df = accumulated_data(raw_data_confirmed, raw_data_deaths)
accum_data_df.head()

In [None]:
country_table = accum_data_df.groupby('Country/Region')['Total Confirmed','Total Deaths'].sum()
country_table_sorted = country_table.sort_values(by='Total Confirmed', ascending=False)
country_table_sorted.style.background_gradient()

### Map Showing Confirmed Cases

In [None]:
#Remove cases from dataset that are zero for 'Total Confirmed' for map
map_data = accum_data_df[accum_data_df['Total Confirmed']>0]

#map of Total Confirmed Cases
covid_map = folium.Map(location=[0, 0], tiles='cartodbpositron',
               min_zoom=1, max_zoom=6, zoom_start=2)

for i in range(0, len(map_data)):
    folium.CircleMarker(
        location=[map_data.iloc[i]['Lat'], map_data.iloc[i]['Long']],
        radius=int(map_data.iloc[i]['Total Confirmed'])/5000,
        color='blue',
        tooltip =   '<li><bold>Country : '+str(map_data.iloc[i]['Country/Region'])+
                    '<li><bold>Province : '+str(map_data.iloc[i]['Province/State'])+
                    '<li><bold>Confirmed : '+str(map_data.iloc[i]['Total Confirmed'])+
                    '<li><bold>Deaths : '+str(map_data.iloc[i]['Total Deaths'])
         ).add_to(covid_map)
    
covid_map


### Map Showing Confirmed Deaths

In [None]:
#Remove cases from dataset that are zero for 'Total Deaths' for map
map_data_deaths = map_data[map_data['Total Deaths']>0]

#map of Total Confirmed Deaths
covid_map_deaths = folium.Map(location=[0, 0], tiles='cartodbpositron',
               min_zoom=1, max_zoom=6, zoom_start=2)

for i in range(0, len(map_data_deaths)):
    folium.CircleMarker(
        location=[map_data_deaths.iloc[i]['Lat'], map_data_deaths.iloc[i]['Long']],
        radius=int(map_data_deaths.iloc[i]['Total Deaths'])/1000,
        color='black',
        tooltip =   '<li><bold>Deaths : '+str(map_data_deaths.iloc[i]['Total Deaths'])).add_to(covid_map_deaths)      
        
covid_map_deaths

In [None]:
#Remove cases from dataset that are zero for 'Total Deaths' for map
map_data_deaths = map_data[map_data['Total Deaths']>0]


## Plot Timeseries Data and Estimate Number of Likely Cases from Death Data

### Plot Data for Countries

In [None]:
#Convert dates from raw data from column headers to a column for easier plotting. 
#Do this for the confirmed, deaths and recovered data sets. 
confirmed_date_column=add_date_column(raw_data_confirmed)
deaths_date_column=add_date_column(raw_data_deaths)
#recovered_date_column=add_date_column(raw_data_recovered)

confirmed_date_column.head()

In [None]:
#Consolidate dates associated with each country for confirmed, deaths and recovered
confirmed_country_df=timeseries_data(confirmed_date_column,'Cases','Total Confirmed Cases', 'Country/Region')
deaths_country_df=timeseries_data(deaths_date_column,'Cases','Total Deaths','Country/Region')
#recoveries_country_df=timeseries_data(recovered_date_column,'Cases','Total Recoveries','Country/Region')

confirmed_country_df.head()

In [None]:
#Calculate daily data for confirmed, deaths and recoveries. This is the difference of data between
#each successive day. 
new_daily_cases_country=daily_data(confirmed_country_df,'Total Confirmed Cases','Daily New Cases')
new_daily_deaths_country=daily_data(deaths_country_df,'Total Deaths','Daily New Deaths')
#new_daily_recoveries_country=daily_data(recoveries_country_df,'Total Recoveries','Daily New Recoveries')

new_daily_cases_country.head()

In [None]:
#Combine the date, country and daily values for confirmed, deaths and recovered into one dataframe. 
country_consolidated_df=pd.merge(confirmed_country_df,deaths_country_df,how='left',left_index=True,right_index=True)
#country_consolidated_df=pd.merge(country_consolidated_df,recoveries_country_df,how='left',left_index=True,right_index=True)
country_consolidated_df=pd.merge(country_consolidated_df,new_daily_cases_country,how='left',left_index=True,right_index=True)
country_consolidated_df=pd.merge(country_consolidated_df,new_daily_deaths_country,how='left',left_index=True,right_index=True)
#country_consolidated_df=pd.merge(country_consolidated_df,new_daily_recoveries_country,how='left',left_index=True,right_index=True)

#Calculate current active cases and add as new column in dataframe
#country_consolidated_df['Active Cases']=country_consolidated_df['Total Confirmed Cases']-country_consolidated_df['Total Deaths']

#Calculate percent recoveries
#country_consolidated_df['Share of Recoveries - Closed Cases']=np.round(country_consolidated_df['Total Recoveries']/(country_consolidated_df['Total Recoveries']+country_consolidated_df['Total Deaths']),2)

#Calculate mortality rate by using total deaths divided total confirmed cases. Note that this mortality rate
#is likely higher than the actual mortality rate because there are cases that aren't counted if the person wasn't
#tested. Also, this is only one way to calculate the mortality rate. There are also people among the total confirmed
#cases who have not yet died or recovered. This number should be considered a best guessed estimate based on 
#available data. 
country_consolidated_df['Death to Cases Ratio']=np.round(country_consolidated_df['Total Deaths']/country_consolidated_df['Total Confirmed Cases'],3)

country_consolidated_df.head()

This part of the model is going to estimate the possible number of cases based on death rate data. Since the number of cases reported is actually low because it only includes cases that have been tested this estimate based on death data may give a more accurate representation of how widespread the virus is. This will also be used to help predict future case numbers and deaths.  

In [None]:
#Estimated parameters
#The incubation period (time from exposure to the development of symptoms) of the virus is estimated to be between 2 and 14 days based on the following sources:
#
#    The World Health Organization (WHO) reported an incubation period for COVID-19 between 2 and 10 days. [1]
#    China’s National Health Commission (NHC) had initially estimated an incubation period from 10 to 14 days [2].
#    The United States' CDC estimates the incubation period for COVID-19 to be between 2 and 14 days [3].
#    DXY.cn, a leading Chinese online community for physicians and health care professionals, is reporting an incubation period of "3 to 7 days, up to 14 days".
#
#The estimated range will be most likely narrowed down as more data becomes available.
days_to_symptoms = 6 #number of days until someone who was exposed shows symptoms

#Average person dies within 14 days after onset of symptoms
#https://onlinelibrary.wiley.com/doi/pdf/10.1002/jmv.25689
days_after_symptoms = 14 #number of days after someone who shows symptoms dies

#time to death is time from exposure that symptoms first appeared, and then time to death 
#after symptoms
death_time = days_to_symptoms + days_after_symptoms

#Current worldwide average is around 1%, however it was higher in China (3.4%) and other countries. 
#Since testing is still not widespread in many places it's hard to know what the true mortality rate is 
#at this point. 
mortality_rate = 0.01 #percent

#There are also various estimates for the doubling rate. The data from the US and Italy indicates that the 
#doubling time is approximately 5 days. 
doubling_time = 3 #days for cases to double

estimated_cases = estimate_current_cases(country_consolidated_df, days_to_symptoms, days_after_symptoms, mortality_rate, doubling_time, 'Country/Region')


In [None]:
#Add estimated cases to country_consolidated_df
country_consolidated_df = pd.merge(country_consolidated_df, estimated_cases, how='left', left_index=True, right_index=True)

pd.set_option('display.max_rows', 12000)
country_consolidated_df

In [None]:
#Place is an array containing a list of countries or a list of states
#df is a dataframe containing the following columns: Total Confirmed Cases, Total Deaths, Death to Case Ratio, 
#estimated_current_cases, estimated_past_cases

def plot_country_data(Country):
    fig = make_subplots(rows=4, cols=1,shared_xaxes=False, 
                    subplot_titles=('Total Confirmed Cases','Deaths', 'Death to Cases Ratio', 'Estimated Current Cases (Daily)', 'Cumulative Sum Estimated Cases'))
    
    fig.add_trace(go.Scatter(x=country_consolidated_df.loc[Country].index,y=country_consolidated_df.loc[Country, 'Total Confirmed Cases'],
                    name='Data',
                    line=dict(color='firebrick', width=3, dash='dot')),
                    row=1, col=1)
    fig.add_trace(go.Scatter(x=country_consolidated_df.loc[Country].index,y=country_consolidated_df.loc[Country,'Total Deaths'],
                    name='Data',
                    line=dict(color='firebrick', width=3, dash='dot')),
                    row=2, col=1)
    fig.add_trace(go.Scatter(x=country_consolidated_df.loc[Country].index,y=country_consolidated_df.loc[Country,'Death to Cases Ratio'],
                    mode='lines+markers',
                    name='Death to Cases Ratio',
                    line=dict(color='LightSkyBlue',width=2)),
                    row=3,col=1)
    fig.add_trace(go.Bar(x=country_consolidated_df.loc[Country].index,y=country_consolidated_df.loc[Country,'Estimated Current Cases'],
                    name='Estimated Current Cases (Daily)'),
                    row=4,col=1)
    fig.append_trace(go.Bar(x=country_consolidated_df.loc[Country].index,y=country_consolidated_df.loc[Country,'Estimated Past Cases'],
                    name='Estimated Current Cases (Daily)'),
                    row=4,col=1)
    
    fig.update_layout(height=1200, width=800,showlegend=False)
    return fig

In [None]:
#Plot country data.  

CountriesList=['US','Italy','Spain','Germany','Iran','China','France','Switzerland','United Kingdom','Netherlands','Austria','Belgium','Norway','Sweden','Korea, South']
interact(plot_country_data, Country=widgets.Dropdown(options=CountriesList))

In [None]:
mean_death_to_cases_ratio = country_consolidated_df.groupby('Country/Region', as_index=True)['Death to Cases Ratio'].agg([np.mean])
mean_death_to_cases_ratio = mean_death_to_cases_ratio.dropna()
mean_death_to_cases_ratio = mean_death_to_cases_ratio[mean_death_to_cases_ratio['mean']>0]
mean_death_to_cases_ratio = mean_death_to_cases_ratio[mean_death_to_cases_ratio['mean']<0.08]

In [None]:
#Table showing mortality rate for each state or province
temp = mean_death_to_cases_ratio
temp_sorted = temp.sort_values(by=['mean'], ascending=False)
temp_sorted.style.background_gradient()

In [None]:
histogram_x = mean_death_to_cases_ratio['mean']
plt.hist(histogram_x, bins='auto')
plt.xlabel('Mortality Rate')
plt.ylabel('Frequency')
plt.show()

In [None]:
mean_death_to_cases_ratio.mean()

### Plot Data for States

In [None]:
#consolidate state data for timeseries plots

confirmed_state_df=timeseries_data(confirmed_date_column,'Cases','Total Confirmed Cases','Province/State')
deaths_state_df=timeseries_data(deaths_date_column,'Cases','Total Deaths','Province/State')
#recoveries_state_df=timeseries_data(recovered_date_column,'Cases','Total Recoveries','Province/State')

confirmed_state_df.head()

In [None]:
#get new daily data from cumulative sum
current_cases_state=daily_data(confirmed_state_df,'Total Confirmed Cases','Daily New Cases')
current_deaths_state=daily_data(deaths_state_df,'Total Deaths','Daily New Deaths')
#current_recoveries_state=daily_data(recoveries_state_df,'Total Recoveries','Daily New Recoveries')



In [None]:
confirmed_state_df

In [None]:
#combine all data sets
state_consolidated_df=pd.merge(confirmed_state_df,deaths_state_df,how='left',left_index=True,right_index=True)
state_consolidated_df=pd.merge(state_consolidated_df,recoveries_state_df,how='left',left_index=True,right_index=True)
state_consolidated_df=pd.merge(state_consolidated_df,current_cases_state,how='left',left_index=True,right_index=True)
state_consolidated_df=pd.merge(state_consolidated_df,current_deaths_state,how='left',left_index=True,right_index=True)
state_consolidated_df=pd.merge(state_consolidated_df,current_recoveries_state,how='left',left_index=True,right_index=True)
state_consolidated_df['Active Cases']=state_consolidated_df['Total Confirmed Cases']-state_consolidated_df['Total Deaths']-state_consolidated_df['Total Recoveries']
state_consolidated_df['Share of Recoveries - Closed Cases']=np.round(state_consolidated_df['Total Recoveries']/(state_consolidated_df['Total Recoveries']+state_consolidated_df['Total Deaths']),2)
state_consolidated_df['Death to Cases Ratio']=np.round(state_consolidated_df['Total Deaths']/state_consolidated_df['Total Confirmed Cases'],3)


In [None]:


estimated_cases = estimated_cases_based_on_death(state_consolidated_df, days_to_symptoms, days_after_symptoms, mortality_rate, doubling_rate,'Province/State', controls)


In [None]:
#Add estimated cases to country_consolidated_df
state_consolidated_df = pd.merge(state_consolidated_df, estimated_cases, how='left', left_index=True, right_index=True)
state_consolidated_df

In [None]:
state_consolidated_df

In [None]:
def plot_state_data(State):
    fig = make_subplots(rows=5, cols=1,shared_xaxes=False, 
                    subplot_titles=('Total Confirmed Cases','Deaths', 'Death to Cases Ratio', 'Estimated Current Cases (Daily)', 'Cumulative Sum Estimated Cases'))
    
    fig.add_trace(go.Scatter(x=state_consolidated_df.loc[State].index,y=state_consolidated_df.loc[State, 'Total Confirmed Cases'],
                    name='Data',
                    line=dict(color='firebrick', width=3, dash='dot')),
                    row=1, col=1)
    fig.add_trace(go.Scatter(x=state_consolidated_df.loc[State].index,y=state_consolidated_df.loc[State,'Total Deaths'],
                    name='Data',
                    line=dict(color='firebrick', width=3, dash='dot')),
                    row=2, col=1)
    fig.add_trace(go.Scatter(x=state_consolidated_df.loc[State].index,y=state_consolidated_df.loc[State,'Death to Cases Ratio'],
                         mode='lines+markers',
                         name='Death to Cases Ratio',
                         line=dict(color='LightSkyBlue',width=2)),
                         row=3,col=1)
    fig.add_trace(go.Bar(x=state_consolidated_df.loc[State].index,y=state_consolidated_df.loc[State,'estimated_current_cases'],
                         name='Estimated Current Cases (Daily)'),
                         row=4,col=1)
    fig.append_trace(go.Bar(x=state_consolidated_df.loc[State].index,y=state_consolidated_df.loc[State,'estimated_past_cases'],
                         name='Estimated Current Cases (Daily)'),
                         row=4,col=1)
    fig.add_trace(go.Scatter(x=state_consolidated_df.loc[State].index,y=state_consolidated_df.loc[State,'estimated_current_cases'].cumsum(),
                         mode='lines',
                         name='Estimated Current Cases (Daily)',
                         line=dict(color='firebrick',width=2)),
                         row=5,col=1)
    
    fig.update_layout(height=1200, width=800,showlegend=False)
    return fig


In [None]:


StateList=['California','Oregon','New York','Texas','Florida','Utah','Wisconsin','Washington']
interact(plot_state_data, State=widgets.Dropdown(options=StateList))

In [None]:
mean_death_to_cases_ratio = state_consolidated_df.groupby('Province/State', as_index=True)['Death to Cases Ratio'].agg([np.mean])


In [None]:
mean_death_to_cases_ratio = mean_death_to_cases_ratio.dropna()


In [None]:
mean_death_to_cases_ratio = mean_death_to_cases_ratio[mean_death_to_cases_ratio['mean']>0]
mean_death_to_cases_ratio = mean_death_to_cases_ratio[mean_death_to_cases_ratio['mean']<0.06]

In [None]:
#Table showing mortality rate for each state or province
temp = mean_death_to_cases_ratio
temp_sorted = temp.sort_values(by=['mean'], ascending=False)
temp_sorted.style.background_gradient()

In [None]:
histogram_x = mean_death_to_cases_ratio['mean']
plt.hist(histogram_x, bins='auto')

In [None]:
mean_death_to_cases_ratio.mean()

# Predict

I'm using the following mathematical model for predicting possible outcomes: 

$\frac{dC_A}{dt} = k C_A^{\alpha}$

$C_A$ is the concentration of people who are infected, people who have died or people who have recovered depending on the data. $k$ is the rate constant.  If it is positive that means that there is positive growth.$\alpha$ is the order and can be used to estimate the growth rate. 

This equation can be integrated to get: 

$C_A = ( C_Ao^{1-\alpha}+(1-\alpha) k t)^{\frac{1}{1-\alpha}}$

Defining $m=\frac{1}{1-\alpha}$

The equation can be rewritten as: 

$C_A = (C_Ao^{1/m} + \frac{k}{m} t)^{m}$

Once k and m are known from the regression, $\alpha$ can be calculated from m as: 

$\alpha = 1 - \frac{1}{m}$

For future analysis, I'm planning to expand the model to take into account controls such as social distancing, handwashing, etc. This can be taken into account by adjusting the rate constant with time. 

In [None]:
#Function for the regression model
def concentration_equation(time, m, k):
    C_ao=0
    return ( C_ao**(1/m) + (1/m)*k*time)**(m)

#Regression model. Place is either a country or state, examples "US", "New York"
#group is either 'Country/Region' or 'Province/State' 
#col_name is the column that you are running the 
#model on.  For instance "Total Confirmed Cases" or "Total Deaths", df is the dataframe.  This could either be
#the country or the state dataframe. 
#Starting date is a number representing the day at which you want the regression to start. For example, 
#if cases didn't really start occuring until day 30 in a specific area, then this number would be 30. The model
#will cut the dataframe to only includes dates after day 30. 
def regression_model(place, group, col_name, df, starting_date):
    
    #arrange the data into either country or state groups
    group= df.groupby(group)
    group = group.get_group(place)

    #cut the data up to the date that you want to start the regression model. 
    group= group.iloc[starting_date:]

    ydata=group.loc[place, col_name]

    time_data = (group.loc[place].index - group.loc[place].index[0]).days
    
    #reshape the data to 1D arrays for curve_fit
    ydata= ydata.values.reshape(-1, 1)
    time_data=time_data.values.reshape(-1,1)
    
    ydata = ydata.flatten()
    time_data= time_data.flatten()
    
    popt, pcov = curve_fit(concentration_equation, time_data, ydata, maxfev=5000)
    
    plt.plot(time_data, ydata, 'o', label='data')
    plt.plot(time_data, concentration_equation(time_data, *popt), 'r-', label='fit: m=%5.3f, k=%5.3f' % tuple(popt))
    plt.xlabel('Number of Days')
    plt.ylabel(col_name)
    plt.title(place)
    
    print(popt)
    
    return popt

    

### US Model and Prediction

In [None]:
us_consolidated_cases_df = country_consolidated_df.copy()

parameters = regression_model('US','Country/Region','Total Confirmed Cases', us_consolidated_cases_df, 30)


In [None]:
#Predict US cases at current rate of increase
C_ao=0


#days_elapsed = (date.today() - us_consolidated_cases_df.loc['US'].index[0]).days


x_time_predict = np.arange(1, 58,1)
y_predict_US = concentration_equation(x_time_predict, parameters[0], parameters[1])

plt.plot(x_time_predict, y_predict_US, label='data')
plt.xlabel('Number of Days')
plt.ylabel('Predicted Cases Over Four Weeks')
plt.title('US- Predicted Cases')

In [None]:
us_consolidated_deaths_df = country_consolidated_df.copy()


parameters = regression_model('US','Country/Region','Total Deaths', us_consolidated_deaths_df, 30)

In [None]:
#Predict US deaths at current rate of increase
C_ao=0

x_time_predict = np.arange(1, 58,1)
y_predict_US = concentration_equation(x_time_predict, parameters[0], parameters[1])

plt.plot(x_time_predict, y_predict_US, label='data')
plt.xlabel('Number of Days')
plt.ylabel('Predicted Deaths Over Four Weeks')
plt.title('US- Predicted Deaths')

### Italy Model and Predictions

### New York Model and Prediction

### California Model and Prediction