This notebook is my practise to do EDA (exploratory data analysis), the first step in data science.

I used to be interested on daily covid data when covid hit Indonesia back in March and April. It had been a while, so let's have a look at them now

Data will be scrapped from:
* https://github.com/CSSEGISandData/COVID-19
* Our World in Data (for Indonesia): https://github.com/owid/covid-19-data/tree/master/public/data
* Covid data for Indonesia from KawalCOVID: http://sinta.ristekbrin.go.id/covid/datasets
* GEOjson for Indonesia: https://bitbucket.org/rifani/geojson-political-indonesia/src/master/

Summary of this notebook:
* Download Covid19 data for worldwide and Indonesia from several sources
* Download GEOjson map for Indonesia
* Worldwide: summary of latest data, worldwide map, death vs confirmed comparison
* Indonesia: summary of latest data, summary plots (total cases, daily cases, positive rate and mortality rate) and other random stats that I am interested in

New skills I picked up and applied on this notebook:
* Using Plotly Express
* Extracting data from Google Sheet API
* Cleaning data. The spreadsheet is messy. Table are stacked on other tables in the same spreadsheet tab
* Extracted data is string. Not sure if there is a way to extract in a numeric format instead of converting it to float manually. For next time, maybe there is a way to just download from Google Sheet automatically and just pd.read_csv()
* Working with GEOjson data format and plotting an interactive map


# Covid19 in Indonesia

# Import necessary python libraries

In [1]:
#collapse
# download python libraries
from datetime import datetime, timedelta
import os
import glob
import wget
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
import json
import plotly.express as px
import plotly.graph_objs as go

# for offline ploting
from plotly.offline import plot, iplot, init_notebook_mode
init_notebook_mode(connected=True)

# Import data

In [2]:
#collapse
# Download data from Github (daily)
os.chdir("C:/Users/Riyan Aditya/Desktop/ML_learning/Project4_EDA_Covid_Indo/datasets")

os.remove('time_series_covid19_confirmed_global.csv')
os.remove('time_series_covid19_deaths_global.csv')
os.remove('time_series_covid19_recovered_global.csv')

# urls of the files
urls = ['https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv', 
        'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv',
        'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv']

# download files
for url in urls:
    filename = wget.download(url)

100% [............................................................................] 264014 / 264014

# Clean & preprocess data

In [3]:
# convert csv to df
confirmed_global = pd.read_csv('time_series_covid19_confirmed_global.csv')
deaths_global = pd.read_csv('time_series_covid19_deaths_global.csv')
recovered_global = pd.read_csv('time_series_covid19_recovered_global.csv')

In [4]:
# Melt DF => switch rows of dates into column for simpler DF
dates = confirmed_global.columns[4:]

confirmed_globalv2 = confirmed_global.melt(id_vars = ['Province/State', 'Country/Region', 'Lat', 'Long'],
                                          value_vars = dates, var_name ='Date', value_name = 'Confirmed')
deaths_globalv2 = deaths_global.melt(id_vars = ['Province/State', 'Country/Region', 'Lat', 'Long'],
                                          value_vars = dates, var_name ='Date', value_name = 'Deaths')
recovered_globalv2 = recovered_global.melt(id_vars = ['Province/State', 'Country/Region', 'Lat', 'Long'],
                                          value_vars = dates, var_name ='Date', value_name = 'Recovered')
print(confirmed_globalv2.shape)
print(deaths_globalv2.shape)
print(recovered_globalv2.shape)

(69958, 6)
(69958, 6)
(66539, 6)


Why are there differences in number of rows between confirmed (or death) & recovered?

This seems to suggest some countries are missing

In [5]:
# Combine df
covid_global = confirmed_globalv2.merge(deaths_globalv2, how='left', on = 
                                        ['Province/State', 'Country/Region', 'Lat', 'Long','Date']).merge(
                                        recovered_globalv2, how='left', on =
                                        ['Province/State', 'Country/Region', 'Lat', 'Long','Date'])

In [6]:
# preprocessing
covid_global['Date'] = pd.to_datetime(covid_global['Date'])

#active cases
covid_global['Active'] = covid_global['Confirmed'] - covid_global['Deaths'] - covid_global['Recovered']

In [7]:
# Data by day
covid_global_daily = covid_global.groupby('Date')['Confirmed','Deaths','Recovered','Active'].sum().reset_index()

In [8]:
# Data by country
temp = covid_global[covid_global['Date'] ==max(covid_global['Date'])].reset_index(drop=True).drop('Date', axis = 1)
covid_global_percountry = temp.groupby('Country/Region')['Confirmed','Deaths','Recovered','Active'].sum().reset_index()

# Worldwide Data Viz

## Latest data

In [9]:
# latest data
print('Date today',covid_global_daily['Date'].iloc[-1])
print('Total cases','{:,}'.format(covid_global_daily['Confirmed'].iloc[-1]))
print('Active cases','{:,}'.format(covid_global_daily['Active'].iloc[-1]))
print('Recovered cases','{:,}'.format(covid_global_daily['Recovered'].iloc[-1]))
print('Deaths cases','{:,}'.format(covid_global_daily['Deaths'].iloc[-1]))

Date today 2020-10-10 00:00:00
Total cases 37,180,308
Active cases 9,070,215.0
Recovered cases 25,689,560.0
Deaths cases 989,564.0


In [10]:
# plot
temp = covid_global_daily[['Date','Deaths','Recovered','Active']].tail(1)
temp = temp.melt(id_vars='Date',value_vars = ['Active','Deaths','Recovered'])
fig = px.treemap(temp, path=['variable'],values = 'value', height = 225)
fig.data[0].textinfo = 'label+text+value'
fig.show()