# Interactive visualizations of coronavirus data using Altair

In this notebook we'll create some interesting graphs to gain some insight about the COVID-19 dataset. If you're looking for forecasts, check out these other notebooks:
* [COVID19 | How many Brazilians will be infected?](https://www.kaggle.com/franlopezguzman/how-many-brazilians-will-be-infected)
* COVID19 | Simple regressor using timedeltas (Coming soon)

Also, here are some useful resources:
* WHO reports on COVID-19: 
    https://www.who.int/emergencies/diseases/novel-coronavirus-2019/events-as-they-happen
* Dataset from the Johns Hopkins University Center for Systems Science and Engineering: 
    https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series
* Altair:
    https://altair-viz.github.io/

**If you find this notebook useful or interesting, please upvote it. Thanks!**

## Table of contents

* [Preprocessing data](#preprocessing)
* [How is coronavirus evolving in different countries?](#world)
* [Coronavirus in the US](#us)

* <a id=preprocessing></a>
## Preprocessing data

To begin with, we need to clean the data and prepare it for our visualizations. Since we're not interested in predictions but merely exploratory analysis, we can get rid of the states column for now.

In [None]:
import numpy as np
import pandas as pd
import altair as alt

In [None]:
df = pd.read_csv('../input/covid19-global-forecasting-week-1/train.csv', parse_dates=['Date'], index_col='Id')
df.head()

In [None]:
df.rename(columns={'Date': 'date',
                     'Province/State':'state',
                     'Country/Region':'country',
                     'Lat':'lat',
                     'Long': 'long',
                     'ConfirmedCases': 'confirmed',
                     'Fatalities':'deaths',
                    }, inplace=True)

In [None]:
df_countries = df.drop(['state'], axis=1).groupby(['country','date']).sum().reset_index()
df_countries.head()

<a id=world></a>
## How is coronavirus evolving in different countries?

COVID19 emerged in China and stayed there for quite some time. It was only after weeks that it emerged to neighbouring countries like South Korea, and later to Europe (currently the epicenter) and the rest of the world. We'll have a look at different countries where COVID19 has had a considerable effect: China, South Korea, Italy, Iran, Spain, USA, and Brazil. 

From the following plots we can see that China has stabilized both the number of confirmed cases and the number of deaths. South Korea and and Iran seemed like potential problematic spots initially, but they managed to control the spread of the virus. Italy and Spain, on the other hand, are still showing an exponential growth of confirmed cases even though both countries are under full lockdown. What's worse, their citizens are dying at a higher rate. Why could that be?

The number of infected people in the US is growing quickly; in a few days it will surely surpass Italy and China. However, the mortality rates in the US are rather low. Maybe the virus hasn't had enough time to kill the infected, or maybe the US health system is better prepared. We'll see in the following days.

We included Brazil for 2 reasons: first, Brazil is a huge country, similar to the US in population and in politcal leadership, so it will be interesting to see whether Brazil will follow USA's numbers; second, Brazil's health system is knowingly worse than that of first world countries – will Brazil be able to handle this pandemic better than Italy or Spain?

In [None]:
#df_countries['country'].unique()
countries_list = ['China', 'Korea, South', 'Italy', 'Iran', 'Spain', 'US',  'Brazil']

df_countries[df_countries['country'].isin(countries_list)].sort_values('date', ascending=False).head()

In [None]:
base = alt.Chart(
    df_countries[df_countries['country'].isin(countries_list)]
).mark_area(
    interpolate = 'monotone',
    fillOpacity = .8
).encode(
    x = 'date',
    y = 'confirmed',
    color = 'country'
)

base.encode(y='confirmed') & base.encode(y='deaths')

The plot below is an interactive scatterplot + histogram. Feel free to select any period you want and see how many infected and dead there were per country.

In [None]:
interval = alt.selection_interval(encodings=['x'])
color = alt.condition(interval, 'country', alt.value('lightgray'))

point_base = alt.Chart(
    df_countries[df_countries['country'].isin(countries_list)]
).mark_point().encode(
    x = 'date',
    color = color
).properties(
    selection = interval
)

hist_base = alt.Chart(
    df_countries[df_countries['country'].isin(countries_list)]
).mark_bar().encode(
    y = 'country',
    color = 'country'
).transform_filter(interval)


point_confirmed = point_base.encode(y='confirmed')
point_deaths = point_base.encode(y='deaths')

hist_confirmed = hist_base.encode(x='confirmed')
hist_deaths = hist_base.encode(x='deaths')

(point_confirmed & hist_confirmed) & (point_deaths & hist_deaths)

The plot below shows very clearly how China was able to control the spread of the virus. We can also see how South Korea, Italy and Iran all had their first cases around the same time, but the evolution since has been very different. Kudos to South Korea and Iran for controlling the coronavirus rapidly!

In [None]:
sort_order = ['China', 'Korea, South', 'Italy', 'Iran', 'Spain', 'US', 'Brazil']

step = 50
overlap = 3

alt.Chart(
    df_countries[df_countries['country'].isin(countries_list)], 
    height=step,
    width = 12*step,
).mark_area(
    interpolate = 'monotone',
    fillOpacity = .8,
).encode(
    alt.X('date'),
    alt.Y('confirmed:Q',
          scale=alt.Scale(range=[step, -step * overlap]),
          axis=None),
    alt.Fill('country'),
    tooltip = ['date','country','confirmed','deaths']
).facet(
    row = alt.Row('country:N',
                  sort=sort_order, #alt.EncodingSortField(field='confirmed', order='descending'),
                  title=None,
                  header=alt.Header(labelAngle=0, labelAlign='right'))
).properties(
    bounds = 'flush',
    title = 'Evolution of onfirmed cases per country'
).configure_facet(
    spacing=0
).configure_view(
    stroke=None
)

In [None]:
df_countries.sort_values(by=['country','date'])
df_countries['daily_confirmed'] = df_countries.groupby('country')['confirmed'].diff().fillna(0)
df_countries['daily_deaths'] = df_countries.groupby('country')['deaths'].diff().fillna(0)

We can also take a look at the daily rates of infected per country. Notice how China has fully controlled the virus, and also how the US currently has the highest number of daily confirmed cases: over 10k new cases every day!

In [None]:
sort_order = ['China', 'Korea, South', 'Italy', 'Iran', 'Spain', 'US', 'Brazil']

step = 50
overlap = 3

alt.Chart(
    df_countries[df_countries['country'].isin(countries_list)], 
    height=step,
    width = 12*step,
).mark_area(
    interpolate = 'monotone',
    fillOpacity = .8,
).encode(
    alt.X('date'),
    alt.Y('daily_confirmed:Q',
          scale=alt.Scale(range=[step, -step * overlap]),
          axis=None),
    alt.Fill('country'),
    tooltip = ['date','country','daily_confirmed','daily_deaths']
).facet(
    row = alt.Row('country:N',
                  sort=sort_order, #alt.EncodingSortField(field='confirmed', order='descending'),
                  title=None,
                  header=alt.Header(labelAngle=0, labelAlign='right'))
).properties(
    bounds = 'flush',
    title = 'Evolution of daily confirmed cases per country',
).configure_facet(
    spacing=0
).configure_view(
    stroke=None
)

<a id=us></a>
## Coronavirus in the US: A Closer Look

Coronavirus is growing very fast in the US. If you've watched TV or read the news lately you probably think most cases come from New York. Right? Well, the data shows that only about half of the infections are from NY. Which other states are suffering from the epidemic the most?

In [None]:
us = df[df['country']=='US'].groupby(['state','date']).sum().reset_index()
us = us[us['date']>='2020-03-07']
relevant_states = us.sort_values('confirmed', ascending=False)['state'].unique()[:20].tolist()
relevant_states.remove('New York')

us['daily_confirmed'] = us.groupby('state')['confirmed'].diff().fillna(0)
us['daily_deaths'] = us.groupby('state')['deaths'].diff().fillna(0)

us['is_NY'] = us['state'] == 'New York'

alt.Chart(
    us.groupby(['date', 'is_NY']).sum().reset_index()
).mark_area(
    interpolate = 'monotone',
    fillOpacity = .8,
).encode(
    x = 'date',
    y = 'confirmed',
    color = 'is_NY:N',
    tooltip = ['date','confirmed']
).properties(
    title='Confirmed cases in NY and rest of USA',
)

In [None]:
us.drop('is_NY', axis=1, inplace=True)

step = 50
overlap = 3

alt.Chart(
    us[us['state'].isin(relevant_states)], 
    height=step,
    width = 12*step,
).mark_area(
    interpolate = 'monotone',
    fillOpacity = .8,
).encode(
    alt.X('date'),
    alt.Y('confirmed:Q',
          scale=alt.Scale(range=[step, -step * overlap]),
          axis=None),
    alt.Fill('state', legend=None),
    tooltip = ['date','state','confirmed','deaths']
).facet(
    row = alt.Row('state:N',
                  sort=relevant_states,
                  title=None,
                  header=alt.Header(labelAngle=0, labelAlign='right'))
).properties(
    bounds = 'flush',
    title = 'Evolution of confirmed cases per state'
).configure_facet(
    spacing=0
).configure_view(
    stroke=None
)

Wow. New Jersey is growing **fast**. California and Washington seemed like they were in trouble at first, but as of today they've got around 2500 confirmed cases each, whereas NJ is closing in on 4k. Also, notice how different states show different behaviours. Some are growing slowly and other show what looks like an exponential growth. Keep an eye on those!

## Now what?

If you liked this notebook, please upvote it. You may also like [this notebook](https://www.kaggle.com/franlopezguzman/covid19-how-many-brazilians-will-be-infected/) where we'll forecast the impact of coronavirus in Brazil.


Also: **STAY AT HOME AND WASH YOUR HANDS**.