# EDA: COVID-19 Analysis in China

To view Plotly Graphs: https://nbviewer.jupyter.org/github/IcedLemonTea0/EDA-COVID-19-in-China/blob/master/EDA%20COVID-19%20Analysis%20in%20China.ipynb

# Importing 

*Datasets obtained from : https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset*

In [1]:
import numpy as np
import pandas as pd
import datetime

#Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
from plotly import __version__
from plotly.offline import download_plotlyjs,init_notebook_mode,plot,iplot

import plotly.graph_objs as go
import plotly.express as px
from plotly.subplots import make_subplots

%matplotlib inline
plt.style.use('fivethirtyeight')

In [2]:
init_notebook_mode(connected=True)

In [3]:
df = pd.read_csv('covid_19_data.csv')

df.head()

Unnamed: 0,SNo,ObservationDate,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,1,01/22/2020,Anhui,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
1,2,01/22/2020,Beijing,Mainland China,1/22/2020 17:00,14.0,0.0,0.0
2,3,01/22/2020,Chongqing,Mainland China,1/22/2020 17:00,6.0,0.0,0.0
3,4,01/22/2020,Fujian,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
4,5,01/22/2020,Gansu,Mainland China,1/22/2020 17:00,0.0,0.0,0.0


# Data Cleaning

We will begin by cleaning the column names to programming-preferred syntax 

In [4]:
df.columns = df.columns.str.lower().str.replace('/','_').str.replace(' ', '_').str.replace('observationdate','observation_date')
df.columns

Index(['sno', 'observation_date', 'province_state', 'country_region',
       'last_update', 'confirmed', 'deaths', 'recovered'],
      dtype='object')

Columns: 
* sno - Serial Number
* Observation_date - Date of observation in MM/DD/YYYY
* Province_state - Province or state of the observation 
* country_region - Country of observation 
* last_update - Time in UTC at which the row is updated 
* confirmed - Cumulative number of confirmed cases till that date
* deaths - Cumulative number of deats till that date
* recovered - Cumulative number of recovered cases till that date

Since we're only focusing on China for this EDA, we'll set our dataframe to 'Mainland China'. 

In [5]:
df = df[df['country_region']=='Mainland China']

print('Number of rows: ', df.shape[0])
df.head()

Number of rows:  1672


Unnamed: 0,sno,observation_date,province_state,country_region,last_update,confirmed,deaths,recovered
0,1,01/22/2020,Anhui,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
1,2,01/22/2020,Beijing,Mainland China,1/22/2020 17:00,14.0,0.0,0.0
2,3,01/22/2020,Chongqing,Mainland China,1/22/2020 17:00,6.0,0.0,0.0
3,4,01/22/2020,Fujian,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
4,5,01/22/2020,Gansu,Mainland China,1/22/2020 17:00,0.0,0.0,0.0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1672 entries, 0 to 5857
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   sno               1672 non-null   int64  
 1   observation_date  1672 non-null   object 
 2   province_state    1672 non-null   object 
 3   country_region    1672 non-null   object 
 4   last_update       1672 non-null   object 
 5   confirmed         1672 non-null   float64
 6   deaths            1672 non-null   float64
 7   recovered         1672 non-null   float64
dtypes: float64(3), int64(1), object(4)
memory usage: 117.6+ KB


# COVID-19 Growth 

Starting off, we'll look at the cumulative growth and the growth rates over time. To determine these growth, We'll look at confirmed, deaths, and recovered columns. Furthermore, we'll identify the regions that are heavily impacted. 

**Cumulative Growth in China**


We will being our analysis by focusing by dates. This will visualize cumulative trends over time.

In [7]:
# Group confirmed, deaths, and recovered cases . 
observe_date = df.groupby(by='observation_date').sum()[['confirmed','deaths','recovered']]
observe_date.head()

Unnamed: 0_level_0,confirmed,deaths,recovered
observation_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
01/22/2020,547.0,17.0,28.0
01/23/2020,639.0,18.0,30.0
01/24/2020,916.0,26.0,36.0
01/25/2020,1399.0,42.0,39.0
01/26/2020,2062.0,56.0,49.0


In [8]:
# convert our dates to datetime 
set_date = pd.to_datetime(observe_date.index)
unique_date = set_date.strftime('%d-%b')

def plot_cumulative_state(scatter_data,col_str,case_str):
    
    '''
        Create 2 subplots.
        Left => Total case over time
        Right => Regions with the most about of case
    '''
    
    name = {'deaths': 'Deaths', 'confirmed':'Confirmed', 'recovered':'Recovered'}
    
    series = scatter_data[case_str]
    store = df.groupby(col_str).max()[case_str].sort_values(ascending =False)[0:11]

    fig = make_subplots(rows=1,cols=2,column_widths=[0.7, 0.3])
    
    fig.add_trace(go.Scatter(
        x = unique_date,
        y = series.values
    ),row = 1, col =1)
    
    fig.add_trace(go.Bar(
        x = store.sort_values(ascending= True).values,
        y = store.sort_values(ascending= True).index,
        orientation = 'h')
    ,row =1, col = 2)
    
    fig.update_xaxes(row=1, col=1,tickvals=unique_date[0::5],tickangle = 45,showgrid = False)
    fig.update_xaxes(row=1, col=2,title_text = name[case_str])
    
    fig.update_yaxes(title_text = name[case_str],row=1, col=1, showgrid= True)

    
    
    fig.update_layout(
        showlegend = False,
        title = 'Total '+ name[case_str]+ ' cases in China',
    )
    
    fig.show()

The province of **Hubei leads the most confirmed recorded cases in China**. Wuhan is the capital city of Hubei, and it was reported that the first cases of COVID-19 originated in the city of Wuhan. 

According to the chart, Hubei significantly leads the number of cases amongst other provinces in China. The second highest, Guangdon, is only 2% of recorded cases in Hubei. The exponential growth of confirmed recorded cases in **China had recorded 80,000 + confirmed cases in less than 30 days** - much of those cases came from Hubei. 

In [9]:
plot_cumulative_state(observe_date,'province_state','confirmed')

**Hubei leads the total recorded death** count in China with 3085 deaths. The second highest is Henan (with 22 deaths).

Looking at the chart, by comparison, the total count of deaths are almost exclusively in Hubei. 

In [10]:
plot_cumulative_state(observe_date,'province_state','deaths')

Due to the size of Hubei's confirmed cases, it also leads in the number of recovered cases. 

Notice the left 'tail' of the graph takes a while to increase. It takes roughly around **14 days after lockdown** until the recovery exponentially increase.  

In [11]:
plot_cumulative_state(observe_date,'province_state','recovered')

# Understanding Growth Rates 

Analyzing growth rate gives us better understanding of where the data trends are heading. 

In [12]:
# Create functions that calculates death rate
def create_rate(df,case):
    rate = []
    for row in range(len(df)):
        series = df.iloc[row][case]
        confirmed = df.iloc[row]['confirmed']
        rate.append(series/confirmed*100)
    return rate

observe_date['death_rate'] = create_rate(observe_date,'deaths')
observe_date['recovery_rate'] = create_rate(observe_date,'recovered')

def growth_rate(df,case):
    rate = []
    for i in range(len(df)-1):
        x1 = df.iloc[i][case]
        x2 = df.iloc[i+1][case]
        pct_delta_x = (x2-x1)/x1 * 100
        rate.append(pct_delta_x)
    return rate


observe_date.head()

Unnamed: 0_level_0,confirmed,deaths,recovered,death_rate,recovery_rate
observation_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
01/22/2020,547.0,17.0,28.0,3.107861,5.11883
01/23/2020,639.0,18.0,30.0,2.816901,4.694836
01/24/2020,916.0,26.0,36.0,2.838428,3.930131
01/25/2020,1399.0,42.0,39.0,3.002144,2.787706
01/26/2020,2062.0,56.0,49.0,2.71581,2.376334


There are two rate of change we will be focusing on:
    * Growth Rate: Calculates daily cases amongst comfirmed cases 
    * Growth factor: Calculates change in daily growth rate 

In [13]:
def create_growth_plots(df, case):
    name = {'deaths': 'Death', 'confirmed':'Confirmed', 'recovered':'Recovery'}
    
    fig = make_subplots(rows=1,cols=2, shared_yaxes=False)

    fig.add_trace(go.Scatter(
        x = unique_date,
        y = create_rate(df,case),
        name = 'Daily '+ name[case] + ' Rate'
    ), row=1, col =1)

    fig.add_trace(go.Scatter(
        x = unique_date,
        y = [np.mean(create_rate(df,case))]*len(df),
        name = 'Average '+ name[case]+' Rate',
        line = dict(dash='dot')

    ), row = 1, col =1)

    fig.add_trace(go.Scatter(
        x = unique_date,
        y = growth_rate(df,case),
        name = 'Daily '+name[case]+' Growth Factor'
    ), row = 1, col =2)

    fig.add_trace(go.Scatter(
        x = unique_date,
        y = [np.mean(growth_rate(df,case))]*len(df),
        name = 'Daily Average '+name[case]+' <br> Growth Factor',
        line = dict(dash='dot')
    ), row=1, col =2)
    
    fig.update_xaxes(row=1,col=1, tickvals=unique_date[0::5],tickangle = 45,showgrid=False)
    fig.update_xaxes(row=1,col=2, tickvals=unique_date[0::5],tickangle = 45,showgrid=False)
    
    fig.update_yaxes(title_text = 'Daily '+ name[case] + ' Rate (%)',row=1, col=1, showgrid= False)
    fig.update_yaxes(title_text = name[case]+' Growth Factor (%)',row=1, col=2, showgrid= False)
    

    fig.update_layout(
        showlegend=False,
        paper_bgcolor='rgba(0,0,0,0)',
        plot_bgcolor='rgba(0,0,0,0)',
        title = 'Overview of daily '+name[case] +' rate vs its growth factor',
        width=1000

        )
    
    fig.show()

Since, we're comparing daily positive cases in our growth rate, the growth rate of positive case will be constant.

The growth factor in positive case spikes 5 days after lockdown. However, the trend goes downwards in the early February, with a minor spike in mid February and then continuing downward. 

At the latest observed date, the growth factor for positive case is less than 1%. This could indicate that the daily confirmed cases are slowing down.     

In [14]:
create_growth_plots(observe_date,'confirmed')

Daily death rate had an exponential growth in mid-February and seems to be slowing down. Noticibly, the death rate in the beginning of the lockdown decreased, but part of that could be is the increase in positive cases (as shown above with spike in confirmed growth factor) - more positive cases skwed a lower daily death rate.   

The overall death growth factor is decreasing with multiple severe spikes in the beginning of lockdown. The trend dives below its average for nearly a month and continues to do so. In the last 7 days, the death growth factor is below 1% and still heading down. This is good news.    

In [15]:
create_growth_plots(observe_date,'deaths')

Daily recovery growth rate is exponentially increasing. The graph also indicates that the growth rate will continue to increase.

Looking at recovery growth factor, there are signficant key spikes that jumps above 60% to 74% in a short period of time. The trend continues downward but maintains an average growth factor of approximately 17% - much higher than the rest.  

In [16]:
create_growth_plots(observe_date,'recovered')

# New COVID-19 Cases

In this task, we'll analyze new cases against cumulative cases. AS we have seen above, the exponential growth cases in our cumulative graphs doesn't say much. To rectify that, we will scale it to logarithmic scale to better show the relationship between the new cases and cumulative cases. 

In [17]:
def new_cases(list):
    changes = []
    for i in range(len(list)-1):
        r1 = list[i]
        r2 = list[i+1]
        changes.append(r2-r1)
    return changes

def plot_comp_new_case(df,case):
    name = {'deaths': 'Deaths', 'confirmed':'Positive', 'recovered':'Recovered'}
    fig = px.scatter(x=df[case][1:len(df)],y=new_cases(df[case]),trendline ='lowess')

    fig.update_layout(
        xaxis_type="log",
        yaxis_type="log",
        title = 'New Reported '+name[case]+' Cases & Total Number of '+name[case]+' Cases',
        xaxis_title = 'Total Number of Reported '+name[case]+' Cases',
        yaxis_title = 'New Reported '+name[case]+' Cases')
    fig.show()

On the graph shown below, the new reported positive cases are dropping significantly. 

In [18]:
plot_comp_new_case(observe_date,'confirmed')

As shown from the graph below, new reported deaths cases are also dropping. 

In [19]:
plot_comp_new_case(observe_date,'deaths')

Unlike others, the new cases for recovery showed drastic increase. 

In [20]:
plot_comp_new_case(observe_date,'recovered')

# Hubei

In our final analysis, we'll look at the province of Hubei.

Hubei leads the number of deaths in China. 
How does Hubei's cases compare with the entire province of China?

In [21]:
hubei = df[df['province_state']=='Hubei']
hubei.head()

Unnamed: 0,sno,observation_date,province_state,country_region,last_update,confirmed,deaths,recovered
13,14,01/22/2020,Hubei,Mainland China,1/22/2020 17:00,444.0,17.0,28.0
51,52,01/23/2020,Hubei,Mainland China,1/23/20 17:00,444.0,17.0,28.0
84,85,01/24/2020,Hubei,Mainland China,1/24/20 17:00,549.0,24.0,31.0
125,126,01/25/2020,Hubei,Mainland China,1/25/20 17:00,761.0,40.0,32.0
169,170,01/26/2020,Hubei,Mainland China,1/26/20 16:00,1058.0,52.0,42.0


In [22]:
fig = go.Figure()

fig.add_trace(go.Bar(
    x = hubei['observation_date'],
    y = hubei['deaths']
))

fig.layout.update(
    title='Number of Recorded Deaths in Hubei',
    xaxis_title = 'Recorded Dates',
    yaxis_title = 'Deaths'
)

fig.update_xaxes(tickangle=75) 

fig.show()

In [23]:
hubei.head()

Unnamed: 0,sno,observation_date,province_state,country_region,last_update,confirmed,deaths,recovered
13,14,01/22/2020,Hubei,Mainland China,1/22/2020 17:00,444.0,17.0,28.0
51,52,01/23/2020,Hubei,Mainland China,1/23/20 17:00,444.0,17.0,28.0
84,85,01/24/2020,Hubei,Mainland China,1/24/20 17:00,549.0,24.0,31.0
125,126,01/25/2020,Hubei,Mainland China,1/25/20 17:00,761.0,40.0,32.0
169,170,01/26/2020,Hubei,Mainland China,1/26/20 16:00,1058.0,52.0,42.0


In [24]:
observe_date_mean = df.groupby('observation_date').mean()

def plot_hubei(case):
    
    name = {'deaths': 'Deaths', 'confirmed':'Positive', 'recovered':'Recovery'}
    rate = {'deaths': 'death_rate','recovered':'recovery_rate' }

    fig = go.Figure()

    fig.add_trace(go.Scatter(
        x = hubei['observation_date'],
        y = create_rate(hubei,case),
        #y = create_recovery_rate(hubei),
        name = 'Hubei\'s '+name[case]+' Rate',
    ))

    fig.add_trace(go.Scatter(
        x = hubei['observation_date'],
        y = create_rate(observe_date_mean,case),
        #y = create_recovery_rate(observe_date_mean),
        name = 'China\'s Average '+name[case]+' Rate',
        line = dict(dash='dot')

    ))

    fig.add_trace(go.Scatter(
        x = hubei['observation_date'],
        y = [observe_date[rate[case]].iloc[len(observe_date)-1]] * len(observe_date),
        name = 'China\'s latest '+name[case]+' rate',
        line = dict(dash='dot')
    ))


    fig.layout.update(
        title = 'Comparing Hubei\'s '+name[case]+' Rate',
        xaxis_title = 'Observed Dates',
        yaxis_title = name[case]+' Rate (%)'
    )

    fig.update_xaxes(tickangle=45) 

    fig.show()

Hubei's Recovery rate follows the average China's recovery rate. The graph suggests that Hubei has a huge influence to the number of cases in China.

In [25]:
plot_hubei('recovered')

As shown above, Hubei's deaths rate almost mirrors China's average deaths rate.

In [26]:
plot_hubei('deaths')

**Correlation**

We observed from the figures above that the mean of China rate vs Hubei's rate are very similar. 
This could largely be the fact that Hubei attributes more than 90% of the recorded rates in China.
To visualize these relationships, lets visualize how close the correlation between the China's average mean and Hubei are closely fitted.  

In [27]:
recovery_list = {'Average Recovery Rate': create_rate(observe_date_mean,'recovered'),'Hubei Recovery Rate': create_rate(hubei,'recovered')} 
death_list = {'Average Death Rate':create_rate(observe_date_mean,'deaths'),'Hubei Death Rate':create_rate(hubei,'deaths')}

recovery_df = pd.DataFrame(recovery_list)
death_list = pd.DataFrame(death_list)

fig = px.scatter(recovery_df,x = 'Average Recovery Rate', y='Hubei Recovery Rate',trendline ='ols')

fig1 = px.scatter(death_list,x = 'Average Death Rate', y='Hubei Death Rate',trendline ='ols')

fig.layout.update(
    title='Average Recovery Rate vs. Hubei Recovery Rate Over Time'
)

fig1.layout.update(
    title = 'Average Death Rate vs. Hubei Death Rate Over Time'
)

fig.show()
fig1.show()

**Insight:**

# UPDATE

In [28]:
fig = px.scatter(x=hubei['confirmed'][1:len(observe_date)],y=new_cases(hubei['confirmed'].values),trendline ='lowess')

fig.update_layout(
    xaxis_type="log",
    yaxis_type="log",
    title = 'New Reported Positive Cases & Total Number of Positive Cases',
    xaxis_title = 'Total Number of Reported Positive Cases',
    yaxis_title = 'New Reported Positive Cases')
