# Quick Wuhan Coronavirus insights 😷

This notebook started as a project to make a simulation of a virus spread (using the coronavirus data) given data about the connections of different regions. Until now, I couldn't find suitable information to do that, however, here I share some data analysis I made to get ideas.

In [None]:
import pandas as pd
import numpy as np
import json
import os
import re

from plotly import graph_objects as go
from plotly import express as px
from plotly.subplots import make_subplots

The main dataset I'm using is the [Coronavirus dataset](https://www.kaggle.com/brendaso/2019-coronavirus-dataset-01212020-01262020), a geojson with the provinces of China and a list of global flight routes.

With this, I try to finc correlations between the interconections and the spread pattern of this virus.

In [None]:
coronadata = pd.read_csv('../input/2019-coronavirus-dataset-01212020-01262020/2019_nCoV_20200121_20200206.csv')
provinces_geojson = geojson = json.loads(open('../input/china-geojson/china-provinces.geojson').read())

In [None]:
coronadata.head(3)

(Some data cleaning/transformation)

In [None]:
# Group by Day
coronadata['Day'] = pd.to_datetime(coronadata['Last Update']).dt.date
# Drop wrong entries
coronadata = coronadata[~coronadata['Confirmed'].isna()]

regions_hist = coronadata.groupby(['Country/Region', 'Province/State', 'Day'])['Confirmed'].max()
data_china = pd.DataFrame(regions_hist['Mainland China']).reset_index()

# Data for making choropleth plots later
data_choro_china = pd.DataFrame(coronadata[coronadata['Country/Region'] == 'Mainland China'].groupby(['Province/State'])['Confirmed'].max()).reset_index()
data_choro_world = pd.DataFrame(coronadata.groupby(['Country/Region'])['Confirmed'].max()).reset_index()

### A general overview
First, taking a look at how is China at the moment (number of infected people)

In [None]:
fig = go.Figure()

fig.add_trace(go.Choropleth(geojson=provinces_geojson, locations=data_choro_china['Province/State'], z=data_choro_china['Confirmed'],
                                    featureidkey="properties.NAME_1",
                                    colorscale="inferno", zmin=0, zmax=2500,
                                    marker_opacity=0.9, 
                                    marker_line_width=1),
)

fig.update_layout(
    title_text = 'Number of infected people per province',
    showlegend = False,
    template='plotly_dark',
    geo = {'scope':'asia', 'showland':True},
    margin={"r":0,"t":40,"l":0,"b":10},
    height=700,
    #width=700,
)

fig.show()

And the same at global (world) level.

In [None]:
fig = go.Figure()
fig.add_trace(
            go.Choropleth(
                locations = data_choro_world['Country/Region'],
                z = (data_choro_world['Confirmed']),
                text = data_choro_world['Country/Region'],
                colorscale = 'inferno',
                locationmode = 'country names',
                marker_line_color='darkgray',
                marker_line_width=0.5,
                colorbar_title = 'Infected (limited to max. 100)',
                zmax=100,
                zmin=-1
            )
)

fig.update_layout(
    title_text = 'Number of infected people per country',
    showlegend = False,
    template='plotly_dark',
    geo = {'scope':'world', 'showland':True},
    height=600,
    #width=1100,
)

fig.show()

Note, in the first map, how the provinces that are next to Hubei tend to be more affected by the epidemic.

At global level, the same relationship is less clear, though. Nevertheless, this is going to be further addressed later.

### How the virus has increased in each province?

The next thing to look at is how rapidly the virus has spread in china

In [None]:
fig = px.line(data_china, x="Day", y="Confirmed", title="Growth of Coronavirus in China's provinces", color='Province/State')

fig.update_layout(
    height=700,
    #width=1100,
    margin={"r":0,"t":30,"l":0,"b":5},
)

fig.show()

The chart shows that the virus has been relatively well isolated, as the increase rate of the provinces compared to Hubei (Wuhan's province) is very small.

Let's take a look at how the mortality rate

In [None]:
death_data = coronadata[coronadata['Country/Region'] == 'Mainland China'].groupby('Province/State')[['Death', 'Recovered', 'Confirmed']].max().reset_index()
death_ratios = pd.DataFrame({"Province":death_data['Province/State'], "DeathRatio":death_data['Death']*100/death_data['Confirmed']}).sort_values('DeathRatio', ascending=False).reset_index(drop=True)
fig = px.bar(death_ratios, x='Province', y='DeathRatio', color='DeathRatio', labels={'DeathRatio':'Death Ratio'}, title='Death ratio per province(percentage)')
fig.show()

Contrary to the alarmism transmited by mass press, we can see that in the worst province (Hubei), the mortality rate of the virus is less than 3%. As a reference, Ebola (50%), HIV (80%) or Polio (22%) are far deadlier. 

You can see more mortality rates [here](https://docs.google.com/spreadsheets/d/1kHCEWY-d9HXlWrft9jjRQ2xf6WHQlmwyrXel6wjxkW8/edit#gid=0).

**How is the increase rate of the virus?**

In [None]:
fig = go.Figure()

provinces = data_china['Province/State'].unique().tolist()
important_provinces = set(['Beijing', 'Hubei', 'Henan', 'Hebei','Shangai', 'Fujian'])
for province in provinces:
    dates = data_china[data_china['Province/State'] == province]['Day'].apply(str)
    data = data_china[data_china['Province/State'] == province]['Confirmed'].pct_change()*100
    
    visible = 'legendonly'
    if province in important_provinces:
        visible = True

    fig.add_trace(go.Scatter(x=dates, y=data, name=province, line_shape='spline', visible=visible))

fig.update_layout(height=600,
                  title_text="Increase rate of the virus each day per province",
                  yaxis={'title':'Percentage change', 'ticksuffix': "%"},
                  xaxis={'title':'Day'}
)
fig.show()

As the data tells, we should not be alarmed about this. The virus seems like a harmless one as its mortality rate is low and the pace of new cases is decreasing.

I would bet that the deads must have been from old people or with some kind of health problem.

### Adding some flight data ✈️

The next thing to do is get data which can be used to correlate the spread with. I think the most obvious example is **flights data**.

Here, I'm getting the OpenFlights data, which contains lists of world airports + routes. Unfortunately, it's a little bit outdated (2014), but I'm going to suppose that the current flight routes are similar to the ones used in this dataset.

In [None]:
if 'airports.dat' not in os.listdir('.'):
    !wget https://raw.githubusercontent.com/jpatokal/openflights/master/data/routes.dat &&\
        wget https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat

In [None]:
routes = pd.read_csv('routes.dat', header=None)
routes.columns = ['Airline', 'AirlineID', 'SourceAirport', 'SourceAirportID', 'DestAirport', 'DestAirportID', 'Codeshare', 'Stops', 'Equipment']
routes = routes[(routes['SourceAirportID'] != '\\N') & (routes['DestAirportID'] != '\\N')]
routes['SourceAirportID'] = routes['SourceAirportID'].astype(int)
routes['DestAirportID'] = routes['DestAirportID'].astype(int)

airports = pd.read_csv('airports.dat', header=None)
airports.columns = ['ID', 'Name', 'City', 'Country', 'IATA', 'ICAO', 'LAT', 'LON', 'ALT', 'TIMEZONE', 'DST', 'TZ', 'TYPE', 'SOURCE']

I'm also going to add extra data to the china airports to indicate in what province are they. So I scrape it from Wikipedia (https://en.wikipedia.org/wiki/List_of_airports_in_China).

In [None]:
if 'airports_province.csv' not in os.listdir('.'):
    import requests
    import bs4

    webpage = requests.get('https://en.wikipedia.org/wiki/List_of_airports_in_China').text
    bs = bs4.BeautifulSoup(webpage)

    table = bs.select('table', {'class':'wikitable'})[0].select('tbody')[0]

    rows = []
    current_province = None
    for c in table.select('tr')[1:]: #Not use table header
        if c.has_attr('class'): #found a section header (province name)
            current_province = c.find('a').text
        else: # its an airport name
            if current_province is not None:
                airport_name = c.find_all('a')[1]['title']
                icao = c.find_all('td')[1].text
                iata = c.find_all('td')[2].text
                rows.append((current_province, airport_name, icao, iata))
    airports_province = pd.DataFrame(rows, columns=['Province', 'Airport', 'ICAO', 'IATA'])
    airports_province.to_csv('airports_province.csv')
    airports_province = airports_province[~airports_province['ICAO'].isna()] # Drop airports under construction
else:
    airports_province = pd.read_csv('airports_province.csv')

In [None]:
china_airports = airports[airports['Country'] == 'China'].loc[:, ['ID', 'Country','Name', 'City', 'ICAO', 'LAT', 'LON']]

china_routes = pd.merge(left=china_airports, right=routes, left_on='ID', right_on='SourceAirportID')
china_routes = pd.merge(left=china_routes, right=airports, left_on='DestAirportID', right_on='ID', suffixes=('', '_dest'))
china_routes = china_routes[['ID', 'Country', 'Name', 'City', 'ICAO', 'LAT', 'LON', 'SourceAirportID', 'DestAirportID', 'ID_dest', 'Name_dest','Country_dest','City_dest', 'ICAO_dest', 'LAT_dest', 'LON_dest', 'Equipment']]
china_routes = pd.merge(left=china_routes, right=airports_province, on='ICAO')[['ID', 'Country', 'Name', 'City', 'Province', 'ICAO', 'LAT', 'LON', 'SourceAirportID', 'ID_dest', 'Name_dest','Country_dest','City_dest', 'ICAO_dest', 'LAT_dest', 'LON_dest', 'Equipment']]

This DataFrame contains all the flights departing from China (to anywhere in the world)

In [None]:
china_routes.head(3)

In [None]:
top_destinations_world = pd.DataFrame(china_routes[china_routes['Country_dest'] != 'China']\
    .groupby('Country_dest')['Country_dest'].count().sort_values(ascending=False)).rename(columns={"Country_dest": "count"})
top_destinations_world = top_destinations_world.reset_index()

top_destinations_china = pd.DataFrame(china_routes[china_routes['Country_dest'] == 'China']\
    .groupby('Province')['Province'].count().sort_values(ascending=False)).rename(columns={"Province": "count"})
top_destinations_china = top_destinations_china.reset_index()

In [None]:
fig = make_subplots(rows=1, cols=2,
                    horizontal_spacing=.25,
                    subplot_titles=("International (Countries)", "National (provinces)"))

fig.add_trace(
    go.Bar(y=top_destinations_world['Country_dest'].to_list(), x=top_destinations_world['count'].to_list(), orientation='h'),
    row=1, col=1
)

fig.add_trace(
    go.Bar(y=top_destinations_china['Province'].to_list(), x=top_destinations_china['count'].to_list(), orientation='h'),
    row=1, col=2
)


fig.update_yaxes(title_text="Destinations", categoryorder = "total ascending", row=1, col=1)
fig.update_yaxes(categoryorder = "total ascending", row=1, col=2)

fig.update_xaxes(title_text="Number of flights", row=1, col=1)
fig.update_xaxes(title_text="Number of flights", row=1, col=2)

fig.update_layout(height=700, title_text="Top destinations of China flights", showlegend=False, 
                  margin={"r":0,"l":10, 'pad':0})
fig.show()

Let's also plot the routes departing from Wuhan and compare them with the number of infected people (in each province)

In [None]:
wuhan_routes = china_routes[(china_routes['Country_dest'] == 'China') & (china_routes['City'] == "Wuhan")]
wuhan_routes = pd.merge(left=wuhan_routes, right=airports_province, left_on='ICAO_dest', right_on='ICAO')[['ICAO_x', 'LAT', 'LON','ICAO_dest', 'Province_y', 'LAT_dest', 'LON_dest']]
#wuhan_routes.groupby('Province_y')['Province_y'].count().sort_values(ascending=False)

fig = go.Figure()


fig.add_trace(go.Choropleth(geojson=provinces_geojson, locations=data_choro_china['Province/State'], z=data_choro_china['Confirmed'],
                                    featureidkey="properties.NAME_1",
                                    colorscale="inferno", zmin=0, zmax=2500,
                                    marker_opacity=0.8, 
                                    marker_line_width=0.01),

)

flight_routes = []
for i in range(len(wuhan_routes)):
    fig.add_trace(
        go.Scattergeo(
            locationmode = 'ISO-3',
            lon = [wuhan_routes['LON'][i], wuhan_routes['LON_dest'][i]],
            lat = [wuhan_routes['LAT'][i], wuhan_routes['LAT_dest'][i]],
            hoverinfo = 'none',
            mode = 'lines',
            line = dict(width = .7,color = 'red'),
            opacity=.2
            #opacity = float(df_flight_paths['cnt'][i]) / float(df_flight_paths['cnt'].max()),
        )
    )

#airports
fig.add_trace(go.Scattergeo(
    locationmode = 'ISO-3',
    lon = china_airports['LON'],
    lat = china_airports['LAT'],
    hoverinfo = 'text',
    text = china_airports['Name'],
    mode = 'markers',
    marker = dict(
        size = 3,
        color = 'rgb(255, 0, 0)',
        line = dict(
            width = 1,
            color = 'rgba(68, 68, 68, 0)'
        )
    )))

fig.update_layout(
    title_text = 'Number of infected people + Flights routes from Wuhan',
    showlegend = False,
    template='plotly_dark',
    geo = {'scope':'asia', 'showland':True},
    #margin={"r":0,"t":0,"l":0,"b":0},
    height=700,
    #width=1100,
)

fig.show()

It doesn't seem that there is a clear pattern between national flights and the virus, however, the areas near to Hubei are the most affected. I suppose that this spread can be better described with car/bus/train data, as there are "short" distances and most of the people use those means of transport.

Let's compare with the rest of the world.

In [None]:
fig = go.Figure()

routes_wuhan_world = china_routes[(china_routes['Country_dest'] != "China") & (china_routes['ICAO'] == 'ZHHH')]

flight_routes = []
for i in range(len(routes_wuhan_world)):
    fig.add_trace(
        go.Scattergeo(
            locationmode = 'ISO-3',
            lon = [routes_wuhan_world['LON'].iloc[i], routes_wuhan_world['LON_dest'].iloc[i]],
            lat = [routes_wuhan_world['LAT'].iloc[i], routes_wuhan_world['LAT_dest'].iloc[i]],
            hoverinfo = 'none',
            mode = 'lines',
            line = dict(width = .7,color = 'red'),
            opacity=.5
            #opacity = float(df_flight_paths['cnt'][i]) / float(df_flight_paths['cnt'].max()),
        )
    )

#airports
fig.add_trace(go.Scattergeo(
    locationmode = 'ISO-3',
    lon = routes_wuhan_world['LON_dest'],
    lat = routes_wuhan_world['LAT_dest'],
    hoverinfo = 'text',
    text = routes_wuhan_world['Name_dest'],
    mode = 'markers',
    marker = dict(
        size = 3,
        color = 'rgb(255, 0, 0)',
        line = dict(
            width = 1,
            color = 'rgba(68, 68, 68, 0)'
        )
    )))




fig.update_layout(
    title_text = 'Routes from Wuhan to the rest of the world',
    showlegend = False,
    template='plotly_dark',
    geo = {'scope':'world', 'showland':True},
    #margin={"r":0,"t":0,"l":0,"b":0},
    height=700,
    #width=1100,
)

fig.show()

Routes from Wuhan to the rest of the world

In [None]:
routes_china_world = china_routes[china_routes['Country_dest'] != "China"]

fig = go.Figure()

fig.add_trace(go.Choropleth(
    locations = data_choro_world['Country/Region'],
    z = (data_choro_world['Confirmed']),
    text = data_choro_world['Country/Region'],
    colorscale = 'inferno',
    locationmode = 'country names',
    marker_line_color='darkgray',
    marker_line_width=0.5,
    colorbar_title = 'Infected (limited to max. 100)',
    zmax=100,
    zmin=-1
))


flight_routes = []
for i in range(len(routes_china_world)):
    fig.add_trace(
        go.Scattergeo(
            locationmode = 'ISO-3',
            lon = [routes_china_world['LON'].iloc[i], routes_china_world['LON_dest'].iloc[i]],
            lat = [routes_china_world['LAT'].iloc[i], routes_china_world['LAT_dest'].iloc[i]],
            hoverinfo = 'none',
            mode = 'lines',
            line = dict(width = .5,color = 'red'),
            opacity=.1
        )
    )

#airports
fig.add_trace(go.Scattergeo(
    locationmode = 'ISO-3',
    lon = routes_china_world['LON_dest'],
    lat = routes_china_world['LAT_dest'],
    hoverinfo = 'text',
    text = routes_china_world['Name_dest'],
    mode = 'markers',
    marker = dict(
        size = 3,
        color = 'rgb(255, 0, 0)',
        line = dict(
            width = 1,
            color = 'rgba(68, 68, 68, 0)'
        )
    )))




fig.update_layout(
    title_text = 'Routes from China to the rest of the world + Infections on each country',
    showlegend = False,
    template='plotly_dark',
    geo = {'scope':'world', 'showland':True},
    #margin={"r":0,"t":0,"l":0,"b":0},
    height=700,
    #width=1100,
)

fig.show()

### Caveats

- At a local scale, flights don't seem very promising for explaining the virus spread. Maybe Train/Car/Bus could show better patterns.
- At a global scale, flight route data shows a pretty clear pattern: Most dense destination countries tend to have more confirmed virus cases.
- As the data shows, the virus doesn't seem to be dangerous. At least, not as much as the media tells (no need to be afraid ;)).

### Next steps

I think it could be useful to merge data about close-distance transports in china (train, bus...).

I managed to scrape the path of the main railways of the country, however, I couldn't find info about the volume of people that goes through each one of them. This could be useful to use simple statistical models like naive bayes to calculate probabilities of virus spread to neighbour cities/provinces.

I would appreciate ideas about how to get such data :)

In [None]:
STATIONS_URL = 'https://www.travelchinaguide.com/china-trains/stations-list.htm'
RAILWAY_URL = 'https://www.travelchinaguide.com/china-trains/railway/network.htm'

req = requests.get(RAILWAY_URL).text
soup = bs4.BeautifulSoup(req)
tables = soup.select('.table1')[:-1]
railway_lines = [t.select('tr')[1].select_one('td').text for t in tables]
railway_lines = [re.split('-|, ', l) for l in railway_lines]

In [None]:
railway_lines