<a href="https://colab.research.google.com/github/LaineWishart/Degree/blob/master/final_product_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **How deadly is COVID-19?**

Notebook by Julia, Laine, Aaron, Demitri, Chloe, and Jamie

# **Introduction**


**Author:** Julia

**Date last edited:** November 17th 2020

Our aim is to uncover the deadliness of COVID-19, and to communicate uncertainty in our analysis. Communicating uncertainty improves stakeholders’ understanding of the deadliness of COVID-19, and informs decisions about how best to respond to COVID-19. COVID-19 affects countries all over the world. Governments need to be informed to make the right health and economic decisions, and health systems and health workers need to be prepared to deliver effective care. The livelihoods and health of residents, COVID-19 patients, and their families are at stake. Every person has the right to accurate information, and the media and government agencies have the responsibility of communicating this information. The data needed for our analysis includes COVID-19 cases, testing, deaths and populations for 6 countries: Australia, Canada, India, Singapore, the United Kingdom, and the United States. This data is needed to understand and compare the deadliness of COVID-19 and its uncertainty between countries.


# **Code**

**Author:** Julia

**Latest date edited:** 15 November 2020

We imported all libraries required for analysis of COVID-19 deadliness and uncertainty in the data at the beginning of our Notebook. 

In [None]:
import pandas as pd
import numpy as np
import io
import matplotlib.pyplot as plt
from typing import List, Dict
!pip install sigfig
import sigfig
import scipy
import plotly.graph_objects as go
import plotly.io as pio
from datetime import datetime
from plotly.subplots import make_subplots
import scipy.stats as sp
import random 
import datetime
import plotly.express as px
import warnings
import sys
from google.colab import files
%pip install -U kaleido
import kaleido
import scipy.stats as sp
import calendar

# from: https://plotly.com/python/orca-management/
!pip install plotly>=4.7.1
!wget https://github.com/plotly/orca/releases/download/v1.2.1/orca-1.2.1-x86_64.AppImage -O /usr/local/bin/orca
!chmod +x /usr/local/bin/orca
!apt-get install xvfb libgtk2.0-0 libgconf-2-4

Requirement already up-to-date: kaleido in /usr/local/lib/python3.6/dist-packages (0.0.3.post1)
--2020-11-18 03:20:51--  https://github.com/plotly/orca/releases/download/v1.2.1/orca-1.2.1-x86_64.AppImage
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-production-release-asset-2e65be.s3.amazonaws.com/99037241/9dc3a580-286a-11e9-8a21-4312b7c8a512?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20201118%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20201118T031925Z&X-Amz-Expires=300&X-Amz-Signature=2456a7cd0306ef4979df1ad4e08e2cb60b5eef074619b568e295e5efc0579630&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=99037241&response-content-disposition=attachment%3B%20filename%3Dorca-1.2.1-x86_64.AppImage&response-content-type=application%2Foctet-stream [following]
--2020-11-18 03:20:51--  https://github-production-release-

**Author:** Julia

**Latest date edited:** November 15, 2020

Reporting the libraries that were used in the analysis provides the reader with the necessary information required to foster transparency, explainability, and accountability. This information, therefore, contributes to the reproducibility of the analysis and results.

**Author:** Laine, Julia

**Latest date edited:** 15 November 2020

Similarly, we have reported the versions of each library below.

In [None]:
%pip freeze 

absl-py==0.10.0
alabaster==0.7.12
albumentations==0.1.12
altair==4.1.0
argon2-cffi==20.1.0
asgiref==3.3.0
astor==0.8.1
astropy==4.1
astunparse==1.6.3
async-generator==1.10
atari-py==0.2.6
atomicwrites==1.4.0
attrs==20.2.0
audioread==2.1.9
autograd==1.3
Babel==2.8.0
backcall==0.2.0
beautifulsoup4==4.6.3
bleach==3.2.1
blis==0.4.1
bokeh==2.1.1
Bottleneck==1.3.2
branca==0.4.1
bs4==0.0.1
CacheControl==0.12.6
cachetools==4.1.1
catalogue==1.0.0
certifi==2020.6.20
cffi==1.14.3
chainer==7.4.0
chardet==3.0.4
click==7.1.2
cloudpickle==1.3.0
cmake==3.12.0
cmdstanpy==0.9.5
colorlover==0.3.0
community==1.0.0b1
contextlib2==0.5.5
convertdate==2.2.2
coverage==3.7.1
coveralls==0.5
crcmod==1.7
cufflinks==0.17.3
cvxopt==1.2.5
cvxpy==1.0.31
cycler==0.10.0
cymem==2.0.4
Cython==0.29.21
daft==0.0.4
dask==2.12.0
dataclasses==0.7
datascience==0.10.6
debugpy==1.0.0
decorator==4.4.2
defusedxml==0.6.0
descartes==1.1.0
dill==0.3.3
distributed==1.25.3
Django==3.1.3
dlib==19.18.0
dm-tree==0.1.5
docopt==0.6.2
docutil

**Author:** Julia

**Latest date edited:** 15 November 2020

We have included the versions of each library because this is required to ensure the code runs in the same way that it does in our analysis if someone else were about to reproduce it by future. This is important because both libraries and their versions are required for reproducibility of the analysis exploring the deadliness and uncertainty around COVID-19.

## **Function development**

For all function development:

**Author:** Laine
 
**Created on:** October 20th, 2020

**Last edited"** November 18th, 2020

### Test datset generation

The test dataset will have data for the following columns:

1. country
2. time
3. new_deaths: number of new deaths per day
4. new_tests: number of new tests conducted per day
5. new_cases: number of new cases conducted per day 

The variables 'new_deaths', 'new_tests', 'new_cases' were generated from a Poisson distribution. A poisson distribution was selected because we observed in our exploratory data analysis that all these variables follow a Poisson distribution. Lambda was selected according to the minimum and maximum numbers recorded for those variables. NaN values were incorporate for 10% of the values for 'new_deaths', 'new_tests', and 'new_cases'. The dates generated randomly from a range of December (when first COVID-19 case was reported) to today. 

First, we define some helper functions to generate the test dataset, then we create it. 

In [None]:
# helper functions to create the artificial dataset
def generate_poisson_distribution(n_distributions, distribution_size, lam_list):

  arr = np.zeros(n_distributions*size)

  counter = 0

  for i in range(0, n_distributions):
      dat = np.random.poisson(lam=lam_list[i], size=size)
      arr[counter:counter+size] = dat
      counter += size
  return arr
  

# adapted from https://www.kite.com/python/answers/how-to-generate-a-random-date-between-two-dates-in-python
start_date = datetime.date(2019, 12, 1)
end_date = datetime.date(2020, 11,4 )

def generate_random_date(start_date, end_date):

  time_between_dates = end_date - start_date
  days_between_dates = time_between_dates.days
  random_number_of_days = random.randrange(days_between_dates)
  random_date = start_date + datetime.timedelta(days=random_number_of_days)

  return random_date

Now we create the test dataset using these functions. 

In [None]:
# Generate the artificial dataset

size = 20
percent_null = 0.1
n_distributions = 5

# generate list of random dates
date = [0]*(size*n_distributions)

for i in range(0, size*n_distributions):
  date[i] = generate_random_date(start_date, end_date)

date = np.array(date)

# generate list of country names 
country = np.array(['canada'*size, 'australia'*size, 'united states'*size, 'united kingdom'*size, 'singapore'*size])
country = ['canada']*size + ['australia']*size + ['united states']*size + ['united_kingdom']*size + ['singapore']*size
country = np.array(country)

# list of reasonable lambda values for each variable
lambda_deaths_per_day = random.sample(range(0, 10), 5)
lambda_tests_per_day = random.sample(range(250, 2000), 5)
lambda_cases_per_day = random.sample(range(10, 6000), 5)


# new deaths
new_deaths = generate_poisson_distribution(5, size, lambda_deaths_per_day)

# new cases 
new_cases = generate_poisson_distribution(5, size, lambda_cases_per_day)

# new tests
new_tests = generate_poisson_distribution(5, size, lambda_tests_per_day)

# combine data into one dataframe 
index = np.array(list(range(0,size*n_distributions)))
test_data= pd.DataFrame({'country': country, 'time': date, 'new_deaths': new_deaths, 'new_cases': new_cases, 'new_tests': new_tests})
nan_mat = np.random.random(test_data.loc[:, ['new_deaths', 'new_cases', 'new_tests']].shape)<0.1

test_data.loc[:, ['new_deaths', 'new_cases', 'new_tests']]= test_data.loc[:, ['new_deaths', 'new_cases', 'new_tests']].mask(nan_mat)

test_data['time'] = pd.to_datetime(test_data['time'], dayfirst = False)
test_data['month'] = pd.DatetimeIndex(test_data['time']).month
test_data['case_fatality_rate'] = test_data['new_deaths']/test_data['new_cases']


test_data.set_index(keys= 'country', inplace=True)

test_data.rename(columns = {'new_deaths': 'confirmed_deaths', 'new_tests': 'tests', 'new_cases': 'confirmed_cases'}, inplace=True)

test_data.head()

Unnamed: 0_level_0,time,confirmed_deaths,confirmed_cases,tests,month,case_fatality_rate
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
canada,2020-01-09,3.0,5753.0,1708.0,1,0.000521
canada,2020-05-31,4.0,5826.0,1660.0,5,0.000687
canada,2020-04-10,3.0,5914.0,1704.0,4,0.000507
canada,2020-06-30,1.0,5744.0,1754.0,6,0.000174
canada,2020-03-21,1.0,5729.0,1655.0,3,0.000175


Now that we have generated the test dataset, we test individual functions to ensure they are working properly. 

### **Numerical Summaries**

We start with numerical summaries. This first function is intended to generate monthly totals for a given variable. 

In [None]:
def get_monthly_stat(data, country, variable_name):
  """Returns monthly totals for a given variable."""

  # convert country to lowercase
  country = country.lower().replace(" ", "_")
  
  # get the index values
  multi_index_vals =  list(data.index.get_level_values(level=0).unique())

  # check the time variable is in datetime format
  assert np.issubdtype(data.loc[:, 'time'], np.datetime64), '"time" column dtype should be datetime64[ns]'

  # check the country is in the multi-index value list
  assert country in multi_index_vals, "{country} not in the index.".format(country=country)

  # check the input variable is in the columns
  assert variable_name in data.columns, "Input variable not found in dataframe columns. "

  # add a month column to the data
  data['month'] = pd.DatetimeIndex(data['time']).month

  if 'rate' in variable_name:
    monthly_stat_df =  round(data.loc[country][['month', variable_name]].groupby('month').mean(), 1)
  else:
    monthly_stat_df =  data.loc[country][['month', variable_name]].groupby('month').sum().astype('int')

  return monthly_stat_df

get_monthly_stat(test_data, 'Canada', 'case_fatality_rate')

Unnamed: 0_level_0,case_fatality_rate
month,Unnamed: 1_level_1
1,0.0
3,0.0
4,0.0
5,0.0
6,0.0
7,0.0
8,0.0
9,0.0
10,0.0
12,0.0


This function appears to be working as expected. We have a column showing each month, and a column showing the number of confirmed deaths. It is also appears to be showing the data to the nearest integer, which is what we would expect for count data. 

Next, we look at a second numerical function which gets all monthly statistics for a given variable and country. 

In [None]:
def get_all_monthly_stats(data, country, variable_name):

  # convert country to lowercase
  country = country.lower().replace(" ", "_")
  
  # get the index values
  multi_index_vals =  list(data.index.get_level_values(level=0).unique())

   # check the time variable is in datetime format
  assert np.issubdtype(data.loc[:, 'time'], np.datetime64), '"time" column dtype should be datetime64[ns]'

  # check the country is in the multi-index value list
  assert country in multi_index_vals, "{country} not in the index.".format(country=country)

  # check the input variable is in the columns
  assert variable_name in data.columns, "Input variable not found in dataframe columns. "

  # get vars from df for selected country
  data_subset = data.loc[country][['time', variable_name]]

  # make time the index
  data_subset.index = data_subset['time']
  data_subset.drop('time', axis=1, inplace=True)

  # calculate the mean, sd, and n for each var
  mean = data_subset.astype('float').resample('M').mean().T
  sd = data_subset.astype('float').resample('M').std().T
  n = data_subset.astype('float').resample('M').count().astype(int).T
  med = data_subset.astype('float').resample('M').median().T
  max = data_subset.astype('float').resample('M').max().T
  min = data_subset.astype('float').resample('M').min().T
  
  # put all the stats together
  monthly_stats_list = pd.concat([n,
                       max,
                       min, 
                       med,
                       mean, 
                       sd ],
                       keys = [
                              'count',
                              'max',
                              'min', 
                              'median', 
                              'mean', 
                              'standard deviation'
                               ]
                      ).reset_index()

  # replace index with multi-level index
  multi_idx = pd.MultiIndex.from_frame(monthly_stats_list.loc[:, ['level_0']], names = ["stat"])
  monthly_stats_list= monthly_stats_list.set_index(multi_idx).drop(['level_0', 'level_1'], axis=1)
  monthly_stats_list.sort_values(['stat'], inplace=True)

  # change the row-level axis to show month names
  new_axis = pd.Series(pd.DatetimeIndex(monthly_stats_list.columns).month).apply(lambda x: calendar.month_abbr[x])

  # set this as the row-level axis
  monthly_stats_list.columns = new_axis

  monthly_stats_list.fillna("Not enough data", inplace = True) 

  # round to nearest decimal place
  monthly_stats_list = round(monthly_stats_list, 1)

  return monthly_stats_list

monthly_stats = get_all_monthly_stats(test_data, 'Canada', 'confirmed_deaths')
monthly_stats


time,Dec,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct
stat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
count,1,3.0,0,1,3.0,1,3.0,1,2.0,1,1
max,3,4.0,Not enough data,1,3.0,4,2.0,3,3.0,3,6
mean,3,3.3,Not enough data,1,2.0,4,1.7,3,2.0,3,6
median,3,3.0,Not enough data,1,2.0,4,2.0,3,2.0,3,6
min,3,3.0,Not enough data,1,1.0,4,1.0,3,1.0,3,6
standard deviation,Not enough data,0.6,Not enough data,Not enough data,1.0,Not enough data,0.6,Not enough data,1.4,Not enough data,Not enough data


This function appears to be working as expected. Each column is a month, and each row is a summary statistic. The data is shown only to one decimal place, except for counts where it is shown to only an integer. Where the value is 'null', the function outputs 'not enough data'. 

### **Graphical Summaries**

Next, we create some functions that generate graphical summaries. These will include a table, one-variable graphs and tables, two-variable graphs and tables, and a graph that graphs one variable but groups by country.

First, we create a function that creates a table for specific variable. We give in header values and cell values, an optional title, and the number of columns to create. We're going to use this function with the monthly stats function, so we will test it in combination with this function. 

In [None]:
def create_table(header_vals, cell_vals, title = None, ncols = 9):
  ncols = len(header_vals)

  table = go.Table(
      columnorder = list(range(1,ncols+1)),
  columnwidth = [20] + [10]*ncols,
    header=dict(values=header_vals,
                fill_color='white',
                line_color = 'black',
                font = dict(size = 14),
                align='left'),
    cells=dict(values=cell_vals,
               fill_color='white',
               line_color = 'lightgrey',
               font = dict(size = 14),
               align='left'))
  

  fig = go.Figure(data=[table])

  fig.update_layout(titlefont = {'size': 28}, 
                    title = title)

  return fig, table 

monthly_stats = get_all_monthly_stats(test_data, 'Canada', 'confirmed_deaths')
header_vals = list(monthly_stats.columns)
cell_vals_list = [monthly_stats[k].tolist() for k in monthly_stats.columns[0:]]

fig, table = create_table(header_vals, cell_vals_list)


The table outputs what we would expect. We can now incorporate it into a figure with multiple subplots. 

We now create a one-variable function. We expect this function to have two subplots: one with charts a variable over time, and another which shows the summary statistics for that variable for each month. We make use of the numerical functions above. 

In [None]:
# define the colours for each variable
var_colour_dict = {'confirmed_deaths': 'black' ,
                 'confirmed_cases': 'rgb(206, 0, 206)',
                 'tests': 'rgb(0, 206, 206)',
                 'mortality_rate': '#FFA15A',
                 'case_fatality_rate':' rgb(227, 66, 52)'}

# choose a colour blind safe palette for country plot
colour_blind_colours = px.colors.qualitative.Safe

country_colour_dict = {'australia': 'rgb(230, 159, 0)' ,
                       'canada': 'rgb(86, 180, 233)',
                       'india': 'rgb(0, 158, 115)',
                       'singapore': 'rgb(204, 121, 167)',
                       'united_kingdom': 'rgb(0, 0,)',
                       'united_states': 'rgb(240, 228, 66)'}


def analyse_one_var(data, Country, variable_name, n, nbins = 10, colour_dict = var_colour_dict):
  """Creates a figure containing a graph with a rolling average and table showing monthly summary statistics. 
  
  Args:
    data: the data containing the data to graph and table
    Country: the name of the country that should be explored (can be upper case)
    variable name: the name of the y-variable that should be the focus of this figure
    nbins: (this was an input when we had the barchart, but it has been exlcuded. Keeping this here incase it will be re-introduced later).
    colour_dict: the colour dictionary to use for the plot
    n: the figure number that should be included in the title and subplot titles 
     """
  # convert country to lowercase
  country = Country.lower().replace(" ", "_")
  
  # get the index values
  multi_index_vals =  list(data.index.get_level_values(level=0).unique())

   # check the time variable is in datetime format
  assert np.issubdtype(data.loc[:, 'time'], np.datetime64), '"time" column dtype should be datetime64[ns]'

  # check the country is in the multi-index value list
  assert country in multi_index_vals, "{country} not in the index.".format(country=country)

  # check the input variable is in the columns
  assert variable_name in data.columns, "Input variable not found in dataframe columns. "

  
  # get the monthly totals for the histogram 
  monthly_totals = get_monthly_stat(data, country, variable_name)

  # get monthly summary stats for the table 
  monthly_stats = get_all_monthly_stats(data, country, variable_name)

  #font_color=['black']*2+[['red' if  val else 'black' for val in df[k].max()] for k in range(monthly_stats.shape[1])]
  font_colour = 'black'
  
  # get the colour for the selected country
  colour = colour_dict[variable_name]

  # assign values to x and y variables
  x = data.loc[country]['time']
  y = data.loc[country][variable_name]

  # assign subplot titles
  subplot_titles = ("Figure {n}.1. Daily {var} over Time".format(var = variable_name.replace("_", " ").title(), n=n),
                    "Figure {n}.2. COVID-19 {var}: Monthly Summary Statistics".format(var=variable_name.replace("_", " ").title(), n=n)) 


  header_vals = [""] + list(monthly_stats.columns)
  cell_vals_list = [[ 'Number of Days with Data',
                              'Max',
                              'Min', 
                              'Median', 
                              'Mean', 
                              'Standard Deviation'
                               ]] + [monthly_stats[k].tolist() for k in monthly_stats.columns[0:]]
  _, tbl = create_table(header_vals, cell_vals_list)

  # make the figure
  fig = make_subplots(
    rows=2, cols=2,
    specs=[[{"type": "scatter", "colspan": 2}, None],
           [{"colspan": 2,  "type": "table"}, None]],
    subplot_titles= subplot_titles, 
    vertical_spacing = 0.15, 
    horizontal_spacing = 0.15)

  # add a table
  fig.add_trace(
     tbl,  
      row = 2, col=1
  )

  # add a line plot
  fig.add_trace(go.Scatter(
      x=x,
      y=y,
      line=dict(color=colour, width=2)
  ), row = 1, col=1)


  
  # update the figure layout
  fig.update_layout(
      title="Figure {n}. COVID-19 {var} in {country}".format(var = variable_name.replace("_", " ").title(), country = country.replace('_', ' ').title(), n=n),
      autosize=True,
      width=1300,
      height=900,
      template = "plotly_white",
      titlefont = {'size': 30},
      showlegend=False
      )
  
  # update line chart axes
  fig.update_xaxes(title_text = "Time", row=1, col=1)
  fig.update_yaxes(title_text="Daily {var}".format(var = variable_name.replace("_", " ")), row=1, col=1)

  if 'rate' in variable_name:
     fig.update_xaxes(title_text = "Average monthly {var}".format(var = variable_name.replace("_", " ")), row=1, col=2)
     if 'case' in variable_name:
       fig.update_yaxes(title_text="Daily {var} (%)".format(var = variable_name.replace("_", " ").title()), row=1, col=1)
  else:
    fig.update_xaxes(title_text = "Total monthly {var}".format(var = variable_name.replace("_", " ")), row=1, col=2)
    
  # update histogram axes
  fig.update_yaxes(title_text="Number of months", row=1, col=2)

  # update the font size for subplot titles
  for i in fig['layout']['annotations']:
    i['font'] = dict(size=20)

  return fig 

analyse_one_var(test_data, 'Canada', 'confirmed_deaths', 1)

This function works as expected. The lines look strange, but this is expected because we generated the data randomly. Now, we will test a two-variable function. For this function, we expect two variables to be plotted as different lines. Each line should have a different style and colour. 

In [None]:
def analyse_two_vars(data, country, first_variable, second_variable, n, fill_between = False, colour_dict = var_colour_dict, rolling = 7):
  # convert country to lowercase
  country = country.lower().replace(" ", "_")
  
  # get the index values
  multi_index_vals =  list(data.index.get_level_values(level=0).unique())

  # check the time variable is in datetime format
  assert np.issubdtype(data.loc[:, 'time'], np.datetime64), '"time" column dtype should be datetime64[ns]'

  # check the country is in the multi-index value list
  assert country in multi_index_vals, "{country} not in the index.".format(country=country)

  # check the input variables are in the columns
  assert first_variable in data.columns, "First input variable not found in dataframe columns. "
  assert second_variable in data.columns, "Second input variable not found in dataframe columns. "

  # option to fill between the two lines (emphasize difference)
  if fill_between:
    fill = 'tonexty'
  else:
    fill = None

  fig = make_subplots(rows = 1, cols =1)

  # select data for x and ys
  x = data.loc[country]['time']
  y1 = data.loc[country][first_variable].rolling(rolling, min_periods=1).mean()
  y2 = data.loc[country][second_variable].rolling(rolling, min_periods=1).mean()

  # specify colours for each variable
  y1_colour = colour_dict[first_variable]
  y2_colour = colour_dict[second_variable]
  
  # add a line plot for the first variable
  fig.add_trace(go.Scatter(
    x=x,
    y=y1,
    name='Daily {first_variable}'.format(first_variable = first_variable.replace("_", " ")),
    line=dict(color=y1_colour, width=2) 

), row = 1, col=1)
  
  # add a line plot for the second variable
  fig.add_trace(go.Scatter(
  x=x,
  y=y2,
  name='Daily {second_variable}'.format(second_variable = second_variable.replace("_", " ")),
  line=dict(color=y2_colour, width=3,
            dash = 'dot'), # add the dotted line for additional cue
  fill=fill
), row = 1, col=1)

  # update the figure layout
  fig.update_layout(
      width=1300,
      height=600,
      template = "plotly_white",
      showlegend=True, 
      titlefont = {'size': 28},
      title = "Figure {n}. COVID-19 daily {first_variable} and {second_variable} in {country}, rolling {rolling}-day average".format(first_variable = first_variable.replace("_", " "),
                                                                                          second_variable = second_variable.replace("_", " "),
                                                                                          country = country.replace("_", " ").title(), rolling = rolling, n=n)
      )
  
  # set y-axis title
  if 'rate' in first_variable and 'rate' in second_variable:
    fig.update_yaxes(title_text = "rate")
  elif 'rate' not in first_variable and 'rate' not in second_variable:
    fig.update_yaxes(title_text="Count", row=1, col=1)
  
  # set x-axis title
  fig.update_xaxes(title_text="Time", row=1, col=1)

  return fig

analyse_two_vars(test_data, "Canada", "confirmed_deaths", "confirmed_cases", 1)

This function also seems to be working as expected: it has plotted two lines, both different colours and shapes. The title indicates the name of the country, the figure number, and the rolling average. Again, both lines look odd but this is because we generated the data randomly. Now, we will move on to the next function which plots all countries on the same plot for a given variable. 

In [None]:
def plot_all_countries(data, n, show_error=False, colour_dict = colour_blind_colours, rolling = 7):
  
  mortality_metric = 'case_fatality_rate'
  
  # reset the index and put countries back into a column so that plotly can group by country
  data = data.reset_index(level=0).rename(columns = {'level_0': 'country'})
  
  # create a column that contains the rolling average
  data['rolling_average'] = round(data.loc[:, mortality_metric].rolling(rolling, min_periods=1).mean(), 1)

  # calculate the deviation from the average for each observation
  data['error'] = data.loc[:, mortality_metric] - data['rolling_average']

  # show error if box is ticked 
  if show_error == True:
    error_selection = "error"
  else:
    error_selection = None

  # generate a figure
  fig = px.line(data,
                y= "rolling_average",
                x = "time",
                color = "country", 
                color_discrete_sequence = colour_blind_colours, 
                error_y = error_selection,
                title="Figure {n}. COVID-19 {mortality_metric} per day for selected countries, rolling {rolling}-day average".format(rolling = rolling, mortality_metric = mortality_metric.replace('_', ' '), n=n).title(), 
                labels = dict(time = "Time",
                 rolling_average = "{mortality_metric} (%)".format(mortality_metric = mortality_metric.replace('_', ' ').title())),
                template = "plotly_white")
  
  # update the figure
  fig.update_layout(titlefont = {'size': 26}, 
                width = 1300, 
                height = 600)
  #fig.show()

  return fig 

plot_all_countries(test_data, 2)

This function also seems to be working as expected. The title shows the correct figure number, the variable being plotted, and states the 7-day rolling average. The legend contains all the countries within the dataset, and each country can be selected or de-selected. The case fatality rate is on the y-axis and time is on the x-axis.  

## **Data Cleaning and Engineering**

**Author:** Julia

**Created:** October 29th, 2020

**Latest date edited:**  October 29th, 2020

Next, importing the data is required for analysis. We originally explored data from a number of sources, but for the sake of making direct comparisons using the data we have elected to conduct all of our analysis using a single dataset that contains all the information we require. This data has been sourced from Our World in Data (https://ourworldindata.org/coronavirus).

In [None]:
# Created by Laine, 2020-10-30
# import csv file straight from Our World in Data's github repo (updates automatically)
raw_data = pd.read_csv('https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv')

Now, the shape of the data and the variables is explored in the dataset.

In [None]:
# Created by Julia, 2020-10-29
print("Shape of dataset: ", raw_data.shape)
raw_data.head()

Shape of dataset:  (57394, 50)


Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,new_deaths_per_million,new_deaths_smoothed_per_million,reproduction_rate,icu_patients,icu_patients_per_million,hosp_patients,hosp_patients_per_million,weekly_icu_admissions,weekly_icu_admissions_per_million,weekly_hosp_admissions,weekly_hosp_admissions_per_million,total_tests,new_tests,total_tests_per_thousand,new_tests_per_thousand,new_tests_smoothed,new_tests_smoothed_per_thousand,tests_per_case,positive_rate,tests_units,stringency_index,population,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index
0,AFG,Asia,Afghanistan,2019-12-31,,0.0,,,0.0,,,0.0,,,0.0,,,,,,,,,,,,,,,,,,,,,38928341.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.498
1,AFG,Asia,Afghanistan,2020-01-01,,0.0,,,0.0,,,0.0,,,0.0,,,,,,,,,,,,,,,,,,,,0.0,38928341.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.498
2,AFG,Asia,Afghanistan,2020-01-02,,0.0,,,0.0,,,0.0,,,0.0,,,,,,,,,,,,,,,,,,,,0.0,38928341.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.498
3,AFG,Asia,Afghanistan,2020-01-03,,0.0,,,0.0,,,0.0,,,0.0,,,,,,,,,,,,,,,,,,,,0.0,38928341.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.498
4,AFG,Asia,Afghanistan,2020-01-04,,0.0,,,0.0,,,0.0,,,0.0,,,,,,,,,,,,,,,,,,,,0.0,38928341.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.498


We can see that this dataset is quite large. It contains 57394 rows and 50 columns. We can also see a number of missing values, which we will leave as NaN. These missing values are one source of uncertainty in the data - we do not know if there is a systematic reason that they are missing, and we do not know whether having this data would affect our analysis. Keeping in mind that we do not have a complete dataset will remind us to be careful about drawing conclusions.

We will now filter the data down to the variables required for our analysis - determining how deadly COVID-19 is in 6 countries: Australia, Canada, India, Singapore, the United Kingdom, and the United States.

In [None]:
# Created by Julia, 2020-10-29
# First, create a datetime variable

raw_data['time'] = pd.to_datetime(raw_data['date'], dayfirst = True)

# function to create separate datasets for each country

def select_country(raw_data, iso_code, select_cols):
  raw_data = raw_data
  iso_code = iso_code
  data = raw_data.loc[:, select_cols]
  data = data[data.loc[:, 'iso_code'] == iso_code ]

  return data

We then use the function to tidy the data, creating individual dataframes for each country. Each dataframe will contain corresponding variables.

In [None]:
# Created by Laine, 2020-10-30

def select_country(raw_data, iso_code, select_cols):
  data = raw_data.loc[:, select_cols]
  data = data[data.loc[:, 'iso_code'] == iso_code ]

  return data

In [None]:

# Created by Julia 2020-10-29
# Adapted by Laine 2020-10-30 -- added select_cols, converted country names to lower case for easier coding 
select_cols = ['time',
               'iso_code',
               'location', #left in so we can easily check the data
                'new_cases',
                'new_deaths',
                'new_tests',
                'population'
                ]

australia = select_country(raw_data, 'AUS', select_cols)
canada = select_country(raw_data, 'CAN', select_cols)
india = select_country(raw_data, 'IND', select_cols)
singapore = select_country(raw_data, 'SGP', select_cols)
united_kingdom = select_country(raw_data, 'GBR', select_cols)
united_states = select_country(raw_data, 'USA', select_cols)

We now have 6 individual tidy dataframes, one for each country of interest: Australia, Canada, India, Singapore, the United Kingdom (UK), and the United States (US). Each dataframe now has 303 rows and 15 columns. The rows are observations, and the columns are: time, ISO code, new cases, total cases, new deaths, total deaths, new tests, total tests, population, median age, people aged 65+, people aged 70+, GDP per capita, cardiovascular death rate, and diabetes prevalence.

## **Data Transformations**

**Author:** Julia

**Created:** October 29th, 2020

**Date last edited:** November 15th, 2020

Two variables were created, such as:
- case_fatality_rate = $\frac{confirmed\ deaths}{confirmed\ cases} \times 100$
- mortality_rate = $\frac{confirmed\ deaths}{total\ population}\times 100\ 000$


We created a function to calculate the case fatality rate, mortality rate per 100 000 people and tests per 100 000 people per day for each country, and then insert these 3 new variables into each dataframe.

These functions were then tested and combined into one function to streamline the process of creating and adding CFR, MR and test rate into each dataframe.

In [None]:
# created by Julia, 2020-10-29
# edited by Laine, 2020-10-30: new vars weren't appearing in data after this function was applied, so i changed it a bit (your original code is commented out)
# function to calculate CFR, MR and test rate and insert into each dataframe

def insert_variables(data):
  """function to calculate case fatality rate (CFR), mortality rate (MR) and test rate and insert into each dataframe. """

  CFR = round((data['new_deaths']/data['new_cases'])*100,1) #1 significant figure

  MR = round((data['new_deaths']/data['population'])*100000,1) #1 significant figure

  test_rate = round((data['new_tests']/data['population'])*100000,1) #1 significant figure

  data['case_fatality_rate'] = CFR
  data['mortality_rate'] = MR
  data['test_rate'] = test_rate

  return data

We used this function to create the 3 new variables for each dataframe.

In [None]:
# created on 2020-10-30 by Laine
# purpose: to reduce code duplication in the cell below
list_of_countries = [australia, canada, india, singapore, united_kingdom, united_states]

for c in list_of_countries:
  c = insert_variables(c)

all_countries = pd.concat(list_of_countries, keys = ['australia', 'canada', 'india', 'singapore', 'united_kingdom', 'united_states'])
all_countries.rename(columns = {'new_deaths': 'confirmed_deaths', 'new_tests': 'tests', 'new_cases': 'confirmed_cases'}, inplace=True)


# **Analysis**

## **How is COVID-19 deadliness measured?**

**Authors:** Laine, Demitri

**Created:** November 12th, 2020

**Date last edited:** November 17th, 2020

### **Metrics to measure mortality**

There are two measures available to describe COVID-19 mortality: mortality rate and case fatality rate. Mortality rate is the number of confirmed COVID-19 deaths per one hundred thousand people. Case fatality rate is the percent of confirmed COVID-19 cases that ended in death. The key difference between the two metrics is that case fatality rate takes into account the number of tests conducted, whereas mortality rate does not. Case fatality rate therefore takes into account the degree to which each country has attempted to detect the disease. For this reason, we measure COVID-19 deadliness according to the case fatality rate. 

### **Showing uncertainty**
We have decided to be transparent about the missing data by leaving a 'gap' in the line chart when there is no data for that day. 

In the table, the first row indicates the number of days with data available for that month. The more days, the more reliable the number. If you there is a month with one or zero days of data, we were not able to calculate summary statistics for that month. When this is the case, we have filled the table cell with the value 'not enough data'.

### **How to use the cell below**

In the cell below, please select the country and variable you would like to analyse. After selecting these variables, press play on the interactive graph to see the results.  

In [None]:
#Created by Laine, 2020-11-12

#@title Please select a country and two variables: { run: "auto" }

Country =  'United Kingdom' #@param  ['Australia', 'Canada', 'India', 'Singapore', 'United Kingdom', 'United States']

country = Country.lower().replace(' ', '_')
mortality_metric = 'case_fatality_rate'


case_fatality_rate_plot = analyse_one_var(all_countries, country, mortality_metric, 1)

#case_fatality_rate_plot.write_html("case_fatality_rate.html")

case_fatality_rate_plot.show()
#files.download("case_fatality_rate.html")

**Authors:** Chloe

**Created:** November 14th, 2020

**Date last edited:** November 14th, 2020
### **Summary**

<table>
<tr>
<th>Country</th>
<th>Australia</th>
<th>Canada</th>
<th>India</th>
<th>Singapore</th>
<th>United Kingdom (UK)</th>
<th>United States (US)</th>
</tr>
<tr>

<td>Daily Mortality Rate</td>

<td>
There are two obvious spikes in the Mortality<br>
rate, the first, smaller spike occurs between<br>
May and late May 2020 while the much larger <br>
second spike occurs in September 2020.
</td>

<td>
There is one large peak around March, and a <br>
second smaller peak building in November. It <br>
also appears that there are no deaths during <br>
August or September due to the number of <br>
significant figures.
</td>

<td>
Mortality rate is always between 0 and 0.1 with <br>
several spikes. Such a low mortality rate might <br>
due to Indian large population.
</td>

<td>
Due to the number of deaths of COVID-19 in <br>
Singapore being only 28, the mortality rate <br>
over time of COVID-19 in Singapore has been <br>
stagnated at zero, despite the huge number <br>
of cases in the country saw in April-May.
</td>

<td>
There is one large peak around March, and a <br>
second smaller peak building in November. It also <br>
appears that there are no deaths during September <br>
due to the number of significant figures.
</td>

<td>
The mortality rate is peaked in mid April and it <br>
has been declining and flunctuating. Recently, <br>
the mortality rate is around 0.3 which means <br>
that there are more daily number of deaths <br>
than the number of population.
</td>

</tr>
<tr>
<tr>
<td>Daily Case Fatality Rate</td>

<td>
There was a peak in early March and the rate is <br>
very low for the rest of time. However, there were two <br>
waves of case fatality rate in both early April - late <br>
May and early August - late October, respectively.
</td>

<td>
There was a big wave of the rate from early March <br>
until late September. Now, the rate is very low but might<br>
increase slightly.
</td>

<td>
Case fatality rate is overall quite low but two significant <br>
spikes are noteworthy.
</td>

<td> Overall, the rate is extremely low which is mostly 0 or <br>
near to 0. However, there was a sudden peak on 13 October which <br>
has 25% as a rate. </td>

<td> There was a big wave of the rate from early March until late<br>
July with the final sudden highest peak at the end. From early <br>
August, the rate is now quite low with approximately 1%.   </td>

<td> There was a peak in early March and then the <br>
rate of case fatality is then declining steadily <br>
until less than 1 today. </td>
</tr>

</table>

Although case fatality rate appears quite high, recent evidence suggests this is because most COVID-19 cases are under-reported (Rajgor et al., 2020); this might be due to asymptomatic individuals, who are not aware they are infected, or testing regimes that do not allow testing of individuals without fever-like symptoms. In reality, the case fatality rate over estimates COVID-19 deadliness. 

#### **Relation to the driving question**
* COVID-19 deadliness, as measured by the case fatality rate, appears to change over time. 
* Missing values might be a source of uncertainty.
* The case fatality rate over-estimates the COVID-19 case fatality rate, which contributes to the uncertainty associated with understanding COVID-19 deadliness 




## **How does COVID-19 deadliness change between countries?**

**Authors:** Jamie, Laine

**Created:** November 12th, 2020

**Date last edited:** November 17th, 2020


### Why compare across countries

To obtain a better understanding of COVID-19 deadliness, we compare how COVID-19 deadliness changes across different countries to determine if its deadliness changes with country-specific factors. 

### A note about the colours 
The colours for this plot were chosen to ensure that **colourblind users** could clearly see the difference between each line. 

### Showing uncertainity
We have decided to be transparent about the missing data by leaving a 'gap' in the line chart when there is no data for that day. 

Because this chart shows the seven day rolling average, which can obscure some of the finer details, we have also included two additional tables: the first table shows the summary statistics for case fatality rate for each country, and the second table standardises these values using a Z-score for easy comparison. 
In the Z-score table, the value shoing the minimum for each country has been removed because each country had the same minimum value, and requires no comparison. 
### How to use the cell below

Press play on the interactive graph to see the results.  You can minimise the number of countries shown by clicking the countries you no longer want see on the legend (located on the right side of the chart). If you want to see one country only, double-click the country name on the legend. 



In [None]:
#Created by Laine, 2020-11-12
#@title Please run this cell to show the interactive graph. { run: "auto" }

mortality_metric = "case_fatality_rate"
country = Country.lower().replace(' ', '_')


# generate the plot
all_countries_plot = plot_all_countries(all_countries, 2)

all_countries_plot.show()

# We also want to show descriptive stats and zscores in a table

# generate summary statistics
case_fatality_sum_stats = round(all_countries.loc[:, ['location', 'case_fatality_rate']].groupby('location').describe(), 0)

# convert count variable to interger
case_fatality_sum_stats['case_fatality_rate']['count'] = case_fatality_sum_stats['case_fatality_rate']['count'].astype('int')

# get the column names for the headers
hdr_vals_sum_stats = ['country'] + list(case_fatality_sum_stats['case_fatality_rate'].columns.values)

# country names for the first column
col1_vals_sum_stats = list(case_fatality_sum_stats.index.values)

# all row values 
cell_vals_sum_stats = [col1_vals_sum_stats, 
          case_fatality_sum_stats['case_fatality_rate']['count'],
          case_fatality_sum_stats['case_fatality_rate']['mean'],
          case_fatality_sum_stats['case_fatality_rate']['std'], 
          case_fatality_sum_stats['case_fatality_rate']['min'],
          case_fatality_sum_stats['case_fatality_rate']['25%'],
          case_fatality_sum_stats['case_fatality_rate']['50%'],
          case_fatality_sum_stats['case_fatality_rate']['75%'],
          case_fatality_sum_stats['case_fatality_rate']['max']] 

# create the table title 
sum_stats_title = "Figure 2.1. COVID-19 {mortality_metric} Summary statistics".format(mortality_metric = mortality_metric.title())

# crete table with sum stats for each country
all_countries_sum_stats_fig, all_countries_sum_stats_tbl = create_table(hdr_vals_sum_stats, cell_vals_sum_stats, sum_stats_title)

# generate zscores
case_fatality_zscores = round(pd.DataFrame(sp.zscore(case_fatality_sum_stats['case_fatality_rate'].drop(columns=['min']), axis=0),
                   index=col1_vals_sum_stats, 
                   columns = list(case_fatality_sum_stats['case_fatality_rate'].drop(columns = ['min']).columns.values))
,0)


# generate header vals for zscore tbl
hdr_vals_zscores = ['country'] + list(case_fatality_zscores.columns.values)

# generate cell vals for zscores tbl
cell_vals_zscores = [col1_vals_sum_stats] + [
                        case_fatality_zscores['count'],                      
                       case_fatality_zscores['mean'], 
                       case_fatality_zscores['std'],  
                       case_fatality_zscores['25%'], 
                       case_fatality_zscores['50%'], 
                       case_fatality_zscores['75%'],
                       case_fatality_zscores['max']]


zscore_title = "Figure 2.2. COVID-19 {mortality_metric} Z scores".format(mortality_metric = mortality_metric.title())

# create table with zscore for each country
all_countries_zscores_fig, all_countries_zscores_tbl = create_table(hdr_vals_zscores, cell_vals_zscores, zscore_title, ncols=8)

# save the tables 
#all_countries_sum_stats_tbl.write_image('all_countries_sum_stats_tbl.png')
#all_countries_zscores_tbl.write_image('all_countries_zscores_tbl.png')
#all_countries_plot.write_image('all_countries_plot_static.png')
#all_countries_plot.write_html("all_countries_plot.html")

# download the saved files 
#files.download('all_countries_sum_stats_tbl.png')
#files.download('all_countries_zscores_tbl.png')
#files.download('all_countries_plot_static.png')
#files.download("all_countries_plot.html")

subplot_titles = [sum_stats_title.title(), zscore_title.title()]
fig = make_subplots(
    rows=2, cols=2,
    specs=[[{"type": "table", "colspan": 2}, None],
           [{ "type": "table", "colspan": 2}, None]],
           subplot_titles = subplot_titles,
    vertical_spacing = 0.15, 
    horizontal_spacing = 0.15)

fig.add_trace(all_countries_sum_stats_tbl, row =1, col = 1)
fig.add_trace(all_countries_zscores_tbl, row = 2, col = 1)

fig.update_layout(titlefont = {'size': 28},width = 1300, 
                height = 600)

  # update the font size for subplot titles
for i in fig['layout']['annotations']:
  i['font'] = dict(size=20)

fig.show()

**Author:** Laine

**Created:** November 18th, 2020

**Date last edited:** November 18th, 2020


In the figures above, it appears that COVID-19 deadliness changes between countries. Looking at the graph of all countries, COVID-19 has been most deadly in Australia and least deadly in Singapore. Interestingly, the graph with the rolling-seven day average shows a different maximum value than what is shown in the summary statistics. For instance, Australia appears to have a maximum case fatality rate of about 35% in Figure 2, in contrast to the maximum value of 100% reported in Figure 2.1. This might be because the rolling-seven-day average smooths over single-day outliers. In this case, the day showing a 100% case fatality rate is correct. 

#### **Relation to the driving question:**


*   COVID-19 deadliness appears to change between countries
*   The seven-day rolling average may be a source of uncertainty







## How do COVID-19 deaths, cases, and tests change over time?

**Authors:** Julia, Chloe, Laine

**Created:** November 12th, 2020

**Date last edited:** November 17th, 2020


We investigated each variable individually in order to understand the distributions of each variable over time. We first looked at deaths to see how the numbers of deaths change over time in each country. We then looked at cases to see the number of cases over time in each country. Finally, we looked at tests to see the number of tests in each country over time. It is important to first investigate variables individually, before comparisons can be made between variables, because it allows us to better understand the relationship between each variable. It also enables us to understand what the source of uncertainty may be, for example if countries have different approaches to testing, which may affect the case fatality rate for the country and the perceived deadliness of COVID-19.

We selected Australia, Canada, India, Singapore, United Kingdom and US as our investigate subjects. The first reason we chose them is their English resources are easily accessible. Moreover, United States and India are the two countries that have the most cases so far, and they represent developed countries and underdeveloped countries with large populations respectively. United Kingdom is one of the biggest hotspots in Europe. For Canada, Australia, and Singapore, these countries have done a relative good job in regard to controlling the spread of coronavirus. And they all have different  populations and different economic conditions.

### **Showing missing data**

We have decided to be transparent about the missing data by leaving a 'gap' in the line chart when there is no data for that day.

In the table, the first row indicates the number of days with data available for that month. The more days, the more reliable the number. If you a month with zero days with data, we were not able to calculate summary statistics for that month. When this is the case, we have filled the table cell with the value 'not enough data

### **How to use the cell below**

In the cell below, please select the country and variable you would like to analyse. After selecting these variables, please run the cell to see the result.
 



In [None]:
#@title Please select a country and a variable. { run: "auto" }

# Author: Laine
# Created: November 11th, 2020
# Last edited: November 18th, 2020

Country =  'United States' #@param  ['Australia', 'Canada', 'India', 'Singapore', 'United Kingdom', 'United States']

Variable = 'Confirmed Cases' #@param ['Confirmed Cases', 'Confirmed Deaths', 'Tests']

country = Country.lower().replace(' ', '_')
variable = Variable.lower().replace(' ', '_')

analyse_one_var(all_countries, country, variable, 3)

**Author:** Chloe

**Created:** November 14th, 2020

**Date last edited:** November 14th, 2020
### **Summary**

<table>
<tr>
<th>Country</th>
<th>Australia</th>
<th>Canada</th>
<th>India</th>
<th>Singapore</th>
<th>United Kingdom (UK)</th>
<th>United States (US)</th>
</tr>
<tr>

<td>Daily Cases over Time</td>
<td>There are two large peaks around March to <br>
May 2020 and July to September 2020.</td>

<td>
A small peak of cases around May, followed by <br>
a much larger peak of cases in November.
</td>

<td>
Dramatic increase since May, peaked at mid <br>
September
</td>

<td>In around April to May, there was a spike in the <br>
number of daily cases, with April 21st having <br>
the peak number of cases per day at 1426 cases.
</td>

<td>
A very sharp increase in cases in October, which <br>
begins to ease in November.
</td>

<td>
November 2020 has the highest confirmed cases <br>
record with approximately 130,000 cases as <br>
median. Moreover, the number of cases flunctuates <br>
but keeps increasing since late March 2020.
</td>
</tr>
<tr>
<td>Daily Deaths over Time</td>
<td>There are two obvious spikes in the number of<br>
deaths, the first, smaller spike occurs between <br>
March and May 2020 while the much larger <br>
second spike occurs between mid-July and <br>
October 2020.
</td>

<td>The number of deaths peaked around May, and <br>
is being followed by a smaller peak in October.
</td>

<td>Gradually increased till mid-September and then <br>
dropped around 50% until 7th November.<br>
The overall trend is smooth, however; there are<br>
two spikes that are noteworthy.
</td>
<td>Overall, the deaths due to COVID-19 in <br>
Singapore are extremely low, with the number<br>
being at 28 deaths total. </td>
<td> The number of deaths peaked in May, and <br>
is being followed by a smaller peak in November.
</td>
<td>The number of deaths has been flunctuating and <br>
decreasing since Mid-April 2020 which had the highest <br>
dramatic number of deaths so far.</td>
</tr>
<tr>
<td>Daily Tests over Time</td>
<td>There is some missingness of test data. <br>
There seems to be a sharp increase in the<br> 
number of tests from April 2020 to August 2020, <br>
after August 2020 the number of tests conducted <br>
continues to decrease before starting to increase<br>
 around October 2020. </td>
<td>Steadily increased over time, until it drops<br>
in October and decreases from here. </td>
<td>Reocrded from 19 March. Showed a overall climbing <br>
trend with some fluctuations.</td>
<td>Overall speaking,Singapore had<br>
good control of COVID-19</td>
<td> the number of COVID-19 tests conducted in <br>
the United Kingdom has steadily increased over time.
</td>
<td>The number of tests has been increasing and <br>
flunctuating until late October 2020. From early <br>
November 2020, the number of tests is declining <br>
steadily due to US Presidential Election 2020.</td>
</tr>
</table>


With regards to COVID-19 across the countries we analysed, it can be seen that COVID-19 has a very low number of deaths in comparison to the number of cases observed. However, countries such as India has had a high number of COVID-19 deaths as compared to the other countries observed, such as Singapore, Australia and Canada, where the number of deaths compared to the number of cases observed has been extremely low. This could be related to the difference in medical care provided in these different countries, as well as the percentage of the ageing population


#### **Relation to the driving question** 


* It is important to explore data visualisations as well as looking at the data itself to determine patterns and sources of uncertainty in the data as they may influence our certainty around concluding the deadliness of COVID-19.

## **What is the relationship between COVID-19 cases, deaths, and tests?**

**Author:** Chloe, Laine

**Created:** November 12th, 2020

**Date lastedited:** November 17th, 2020

Testing plays an important role in understanding the deadliness of COVID-19.  For instance, if the number of tests conducted per day is low, we have reason to suspect that COVID-19 cases and deaths have been underreported.

As mentioned earlier, the number of confirmed cases and the number of deaths are two sources to calculate case fatality rate. If any of them or even both of them contain uncertainty, our case fatality rate would be uncertain as well. Meanwhile, the test rate influences number of confirmed cases and confirmed death. In this case, we have to look into them separately in order to get a full understanding of uncertainty and accuracy of case fatality rate. 

### Visualising long-term trends 
It has been estimated that after contraction of COVID-19, the average onset of death is 21 days. To relect this delay, we show the data as a **21-day rolling average** instead of a 7-day rolling average. This allows easier us to see the relationship between deaths and other variables more easily. 


### Using an appropriate scale 
Daily deaths are quite small relative to daily cases or tests. When we plot them on a graph together, we want to see how one variable varies with another. To compare two variables on two very different scales, we use a **logarithmic scale** on the y-axis.  

### Showing missing data
We have decided to be transparent about the missing data by leaving a 'gap' in the line chart when there is no data for that day. 

### A note about the design 

The variables plotted have two different line stypes (one continuous, the other dashed). These different line styles have been implemented to ensure this chart is **colour-blind friendly**. 

We chose to show fill the space between the first variable and second variable to emphasise how the difference between these two variables changes over time. Emphasizing the difference in this way makes use of 'enclosure' **Gestalt principle**. 

### How to use the cell below
In the cell below, please select the country and variables you would like to analyse. After selecting these variables, press play on the interactive graph to see the results.  

In [None]:
#@title Please select a country and two variables. { run: "auto" }

# created by Laine, 2020-11-09

Country =  'Australia' #@param  ['Australia', 'Canada', 'India', 'Singapore', 'United Kingdom', 'United States']

First_Variable = 'Confirmed Deaths' #@param ['Confirmed Cases', 'Confirmed Deaths', 'Tests']

Second_Variable = 'Confirmed Cases' #@param ['Confirmed Cases', 'Confirmed Deaths', 'Tests']

country = Country.lower().replace(' ', '_')
variable1 = First_Variable.lower().replace(' ', '_')
variable2 = Second_Variable.lower().replace(' ', '_')



cases_vs_deaths_fig = analyse_two_vars(all_countries, country, variable1, variable2, 4, fill_between=True, rolling = 21)

# y is log-scale:this is done to better see the relationship between cases and deaths 
cases_vs_deaths_fig.update_yaxes(type="log", title_text = "Count (Log Scale)")

cases_vs_deaths_fig.update_layout(titlefont = {'size': 24})

Contributed to by Chloe, 2020-11-14
### **Summary**

<table>
<tr>
<th>Country</th>
<th>Australia</th>
<th>Canada</th>
<th>India</th>
<th>Singapore</th>
<th>United Kingdom (UK)</th>
<th>United States (US)</th>
</tr>
<tr>
<td>Log Tests vs. Log Deaths</td>
<td> we do not have testing results for <br>
the months of January to April. The number<br>
of cases spikes around mid-April and then <br>
starts to decrease until mid-May where the<br>
number of cases begins to increase before <br>
peaking in Mid July.</td>
<td>
The rate at which testing increases<br>
appears steeper early in 2020, <br>
and levels out to a weak positive incline.<br>
The rate at which deaths due to COVID-19 <br>
grow appears to steadily increase overtime.
</td>

<td>
The growth rate of tests is 1000 <br>
times greater than the growth rate of deaths.
</td>

<td>
High number of tests being done, <br>
deaths due to the virus has been low. <br>
Testing can be seen from the graph to <br>
control the spread of COVID-19 within<br>
the population
</td>

<td>
The rate at which testing increases <br>
appears consistent. <br>
The rate at which deaths due to COVID-19<br>
grow appears to drop around August, <br>
and then continue to increase again from September.
</td>

<td>
Both daily tests and daily deaths increase <br>
with similar rate. However, it has much difference <br>
between daily deaths and daily tests.
</td>

</tr>
<tr>
<td>Log Tests vs. Log Cases</td>
<td>There is some missingness in the test data. <br>
The number of cases spikes around mid-April and<br>
then starts to decrease until mid-May where the<br>
number of cases begins to increase before peaking<br>
in Mid July.</td>

<td>
There appears to be a drop in the rate of <br>
increase of cases around October, however <br>
test continue to grow at this time. </td>
<td>The daily case growth rate is consistently <br>
lower than daily test growth rate, with similar trend.
</td>

<td>
Despite having a high number of confirmed<br>
COVID-19 cases, much testing has also been done, <br>
in some times 5 times more than the number of <br>
confirmed cases.
</td>

<td>
A mild fluctuation in the rate of increase of <br>
cases around August, however testing continues at<br>
a consistent rate. There is early<br>
missing test data, however the data that is present<br>
appears similar to the cases data, with consistently<br>
more tests being conducted than cases confirmed.
</td>

<td>
The difference between daily number of tests <br>
and daily number of cases is very big, which means<br> 
that there are more negative cases than positive<br> 
cases after tested daily.
</td>

</tr>
<tr>
<td>Log Deaths vs. Log Cases</td>
<td>The two lines on the plot behave in a similar<br>
fashion to one another, as the cases go up, more<br>
deaths are reported. </td>
<td>The daily cases and daily deaths data both <br>
appear to increase at the same rate over time.
</td>
<td>The growth rate of new cases was firstly <br>
increased dramatically and now tend to be stable. <br>
Overall speaking, the growth rate of daily <br>
cases is higher than daily deaths.</td>
<td>The number of cases in Singapore, although <br>
high do not correlate to the deaths at all. This <br>
graph overall shows the low.</td>
<td>Deaths and cases appear to be increasing<br>
at a similar rate in the United Kingdom<br>
for the most part.</td>

<td>
Both daily cases and daily deaths increase with <br>
similar rate. However, there is much difference <br>
between daily cases and daily deaths which means <br>
that it has many positive cases reported but few <br>
deaths reported.
</td>
</tr>
</table>

#### **Relation to the driving question**

There appears to be a strong relationship between cases and deaths over time. It therefore may be possible to loosley interpret the deadliness of COVID-19 based on case data.



# Discussion

## Considering bias 

*Explain how your presentation takes account of potential biases of data analysts and of target external audience, particularly confirmation bias, anchoring bias.* 


We have taken into account confirmation bias and anchoring bias in our presentation through numerous techniques. To deal with anchoring bias in our presentation and our work we discussed as a group how we could logically present the data. We decided that plotting variables over time would be a useful way for our audience to understand the information. Showing how the variables change over time is a good way to combat anchoring bias as it stops the audience from thinking about one particular point in time when for example COVID-19 was at its worst, or there was a drop in cases, and it allows them to see the full picture without always falling back to a particular piece of information. To combat confirmation bias, our group thought about the data in a flexible way. We constantly questioned what the data meant and what it represented, rather than simply accepting it as fact. We explored multiple datasets, read media articles, and scientific articles to get more perspectives on the data and the context of uncertainty. This meant that we were not seeking to confirm a hypothesis, rather we explored the data openly and aimed to display it with minimal transformations. As a group, we discussed when a graph appeared unclear or hard to interpret and agreed on improvements to make the information easier to understand. We challenged our personal beliefs and made decisions that maintained the integrity of the data. We also used approaches similar to the six thinking hats to get different perspectives on the data and thus create different plots and think about how they might be interpreted by different audiences.

## Miller's law

*Explain how the design of information presentation takes account of Miller’s Law (and its more conservative versions that recommendations of 5 +/– 2).*

Our presentation takes into account Miller’s law through the number of plots we display in the presentation and the number of points we talk about in the presentation/per slide. Our process Notebooks contained numerous plots that exceeded the recommendation of 5, as such, we reduced the number of plots in our final presentation and our product notebook. We also reduced the number of speaking points per slide to ensure that we were under Miller’s recommendation of 5 +/- 2. We kept colours consistent across variables in some plots and across countries in others. We used one colour for each variable and one colour for each country so the information remained consistent and did not confuse the audience. We also used a few colours as possible, but we needed at least one for each of the 6 countries.

## Visual perception

*Explain how you took account of theory about visual perception, especially colour, colourblindness, care in use of blinking.*

We used colours in our presentation and Product notebook that have positive connotations and are not associated with any negative connotations in different countries as people from other countries may be looking at our work. We also carefully considered colour blindness in both our presentation and product notebook and only used colours that were not affected by these conditions. We designed/produced every plot/visualisation first in Monochrome then as a group discussed the most appropriate colours that should be used for each plot/visualisation. In addition to this, line plots that compare multiple variables either have one variable that is plotted with a solid line and the other variable plotted with a dashed line, or the viewer is able to click on lines and can remove them from the graph and see a label when they mouse over the line (for example, as in the country graph). This makes it easier for visually impaired people to see and interpret our results. Our line plots are also interactive which allows the user to individually isolate each variable for ease of viewing. This means that data across the 6 countries is shown in a consistent manner, also highlighting the careful consideration our group has undertaken when it comes to implementing Gestalt theory in our Product notebook.

## Gestalt theory

*Explain how you took account of account of gestalt theory.*

We took account of Gestalt theory in both our presentation and Product notebook. For our presentation, we actively engaged Gestalt theory through ensuring that each bullet point on each slide deals with another topic or a sub-topic (Law of Proximity), and we ensured that the grouping of information and spacing on the slides was adequate and not too crowded. We aimed to make our plots/visualisations as aesthetically pleasing as possible and incorporated interactive elements to ensure space was appropriately used. We also employed a template for our slides to make our presentation more visually appealing and eye-catching, and to also create a sense of continuity. For our Product notebook, we used spacing to ensure that each new topic/point of the investigation had its own clear new section, and we also ensured that each point of our conversation was spaced evenly. We also used different sized headings to indicate different sections and topics that the user would be able to clearly see, as well as an index on the side of the page that allows the user to quickly see the topics.

## Presentation of uncertainty
*Explain how the information presentation and associated text were designed to enable the reader to understand the accuracy/uncertainty of the results.*

We used line graphs to highlight the missing data and uncertainty in our dataset. For example, if there were missing entries in the data, this appears as a gap in the plot. We created graphs containing key information including the number of days we had data for each month. There were often days missing data, again contributing to uncertainty and the accuracy of a conclusion that could be made about the results. We also performed think-alouds with various other groups and even family members to make sure that the aim of our visualisations were coming across clearly, that aim being to highlight the uncertainties present in our data. The most obvious way that we employed to ensure the reader understood the accuracy/uncertainty of the results was to explicitly state and discuss this in our analysis of results and the limitations of our work. We also made this context of uncertainty clear in our presentation, as well as informing viewers that humans tend to seek out certainty, which is an inherent bias that likely contributes to why the media often portrays information as certain.


## Limitations

Data collection is a limitation because it creates uncertainty. Countries approach testing and reporting COVID-19 data very differently, and this affects how we interpret the results. Another limitation is balancing performing transformations to understand the data with maintaining data integrity. We seek to communicate accurate information that has been altered as little as possible.


# Conclusion

Our aim was to uncover the deadliness of COVID-19, and to communicate uncertainty in our analysis. We found that COVID-19 deadliness, as measured by case fatality rate, appears to change over time and between countries. We then looked at country specific testing, cases, and deaths for COVID-19 and found differences within these figures too. These discrepancies may reflect each country's ability to detect and report COVID-19 cases, tests, and deaths, and may account for some uncertainty associated with reporting the case fatality rate. Additional sources of uncertainity that were identified included missing values and the use of a seven-day rolling average. 

# References
Hasell, J., Mathieu, E., Beltekian, D. et al. A cross-country database of COVID-19 testing. Sci Data 7, 345 (2020). https://doi.org/10.1038/s41597-020-00688-8

Rajgor, D. D., Lee, M. H., Archuleta, S., Bagdasarian, N., & Quek, S. C. (2020). The many estimates of the COVID-19 case fatality rate. The Lancet Infectious Diseases, 20(7), 776-777.