**Goal**

Explore the effect of COVID-19 across Mainland China

In [28]:
# Import
import numpy as np
import pandas as pd

#Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
from plotly import __version__
from plotly.offline import download_plotlyjs,init_notebook_mode,plot,iplot
#import plotly.plotly as py
import plotly.graph_objs as go
import plotly.express as px


%matplotlib inline
plt.style.use('fivethirtyeight')

In [23]:
init_notebook_mode(connected=True)

In [24]:
df = pd.read_csv('covid_19_data.csv')
df.head()

Unnamed: 0,SNo,ObservationDate,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,1,01/22/2020,Anhui,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
1,2,01/22/2020,Beijing,Mainland China,1/22/2020 17:00,14.0,0.0,0.0
2,3,01/22/2020,Chongqing,Mainland China,1/22/2020 17:00,6.0,0.0,0.0
3,4,01/22/2020,Fujian,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
4,5,01/22/2020,Gansu,Mainland China,1/22/2020 17:00,0.0,0.0,0.0


# Data Cleaning

We will begin by cleaning the column names to more programming friendly syntax 

In [25]:
df.columns = df.columns.str.lower().str.replace('/','_').str.replace(' ', '_').str.replace('observationdate','observation_date')
df.columns

Index(['sno', 'observation_date', 'province_state', 'country_region',
       'last_update', 'confirmed', 'deaths', 'recovered'],
      dtype='object')

Columns: 
* sno - Serial Number
* Observation_date - Date of observation in MM/DD/YYYY
* Province_state - Province or state of the observation 
* country_region - Country of observation 
* last_update - Time in UTC at which the row is updated 
* confirmed - Cumulative number of confirmed cases till that date
* deaths - Cumulative number of deats till that date
* recovered - Cumulative number of recovered cases till that date

We will only focus on China for this EDA.

In [26]:
df = df[df['country_region']=='Mainland China']
print('Number of rows: ', df.shape[0])
df.head()

Number of rows:  1672


Unnamed: 0,sno,observation_date,province_state,country_region,last_update,confirmed,deaths,recovered
0,1,01/22/2020,Anhui,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
1,2,01/22/2020,Beijing,Mainland China,1/22/2020 17:00,14.0,0.0,0.0
2,3,01/22/2020,Chongqing,Mainland China,1/22/2020 17:00,6.0,0.0,0.0
3,4,01/22/2020,Fujian,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
4,5,01/22/2020,Gansu,Mainland China,1/22/2020 17:00,0.0,0.0,0.0


In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1672 entries, 0 to 5857
Data columns (total 8 columns):
sno                 1672 non-null int64
observation_date    1672 non-null object
province_state      1672 non-null object
country_region      1672 non-null object
last_update         1672 non-null object
confirmed           1672 non-null float64
deaths              1672 non-null float64
recovered           1672 non-null float64
dtypes: float64(3), int64(1), object(4)
memory usage: 117.6+ KB


**Number of COVID-19 deaths in China**

The objective of this section is to find the number of deaths and compare its deaths across Mainland China

In [7]:
total_deaths = df.groupby(by='province_state').max()['deaths'].sort_values(ascending=True)

In [8]:
fig = go.Figure(go.Bar(
    y=total_deaths.index,
    x=total_deaths.values,
    orientation ='h'
    
))

fig.layout.update(
    title = 'Cumulative deaths in China',
    xaxis_title = 'Deaths',
    yaxis_title = 'Region'
)

fig.show()

**Insight**

* Approximately **95% of deaths (3085 deaths)** is located in the region of **Hubei**
* The second highest (with 22 deaths) is located in the region of **Henan**

