<a href="https://colab.research.google.com/github/Sinamhd9/Covid-19-EDA/blob/main/covid_19_eda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem Description

The dataset in this project is Covid 19 dataset from COVID19 Global Forecasting (Week 5) of kaggle competitions.

In this notebook, we will visualize the daily number of confirmed COVID19 cases in various locations across the world, as well as the number of fatalities. Pandas and Plotly are mainly used for visualization and explanatory data analaysis (EDA).

In [None]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import plotly.express as px
import plotly
plotly.__version__

'4.11.0'

In [None]:
df_train = pd.read_csv('../input/covid19-global-forecasting-week-5/train.csv')
df_test = pd.read_csv('../input/covid19-global-forecasting-week-5/test.csv')
display(df_train.head(2))
display(df_train.tail(2))
print('Train shape:', df_train.shape)

display(df_test.head(2))
display(df_test.tail(2))
print('Test shape', df_test.shape)

Unnamed: 0,Id,County,Province_State,Country_Region,Population,Weight,Date,Target,TargetValue
0,1,,,Afghanistan,27657145,0.058359,2020-01-23,ConfirmedCases,0
1,2,,,Afghanistan,27657145,0.583587,2020-01-23,Fatalities,0


Unnamed: 0,Id,County,Province_State,Country_Region,Population,Weight,Date,Target,TargetValue
969638,969639,,,Zimbabwe,14240168,0.060711,2020-06-10,ConfirmedCases,6
969639,969640,,,Zimbabwe,14240168,0.607106,2020-06-10,Fatalities,0


Train shape: (969640, 9)


Unnamed: 0,ForecastId,County,Province_State,Country_Region,Population,Weight,Date,Target
0,1,,,Afghanistan,27657145,0.058359,2020-04-27,ConfirmedCases
1,2,,,Afghanistan,27657145,0.583587,2020-04-27,Fatalities


Unnamed: 0,ForecastId,County,Province_State,Country_Region,Population,Weight,Date,Target
311668,311669,,,Zimbabwe,14240168,0.060711,2020-06-10,ConfirmedCases
311669,311670,,,Zimbabwe,14240168,0.607106,2020-06-10,Fatalities


Test shape (311670, 8)


## Data types and Missing values

In [None]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 969640 entries, 0 to 969639
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Id              969640 non-null  int64  
 1   County          880040 non-null  object 
 2   Province_State  917280 non-null  object 
 3   Country_Region  969640 non-null  object 
 4   Population      969640 non-null  int64  
 5   Weight          969640 non-null  float64
 6   Date            969640 non-null  object 
 7   Target          969640 non-null  object 
 8   TargetValue     969640 non-null  int64  
dtypes: float64(1), int64(3), object(5)
memory usage: 66.6+ MB


We can see that there are missing values for the County and Province_State since these information are only available for countries such as US.  

We change the datetime to Pandas datetime object. 

In [None]:
df_train['Date'] = pd.to_datetime(df_train['Date'])
df_test['Date'] = pd.to_datetime(df_test['Date'])

print(df_train['Date'].dtypes)

datetime64[ns]


# Explanatory Data Analysis

Pie charts, animated bar plots and animated choropleth maps are utilized to visualize the data. 

We will create two dataframes of confirmed cases (cc) and fatalities (fat). It should be noted that US cases are multiple reported with and without State and County information. Thus, summing over 'Country_Region' will result in wrong number of cases for US.

In [None]:
df_cc = df_train[(df_train['Target'] == 'ConfirmedCases') & (pd.isnull(df_train['Province_State'])) & (
    pd.isnull(df_train['County']))]
df_fat = df_train[(df_train['Target'] == 'Fatalities') & (pd.isnull(df_train['Province_State'])) & (
    pd.isnull(df_train['County']))]
display(df_cc.head(2))
print(df_cc.shape)
display(df_fat.head(2))
print(df_fat.shape)

Unnamed: 0,Id,County,Province_State,Country_Region,Population,Weight,Date,Target,TargetValue
0,1,,,Afghanistan,27657145,0.058359,2020-01-23,ConfirmedCases,0
2,3,,,Afghanistan,27657145,0.058359,2020-01-24,ConfirmedCases,0


(26180, 9)


Unnamed: 0,Id,County,Province_State,Country_Region,Population,Weight,Date,Target,TargetValue
1,2,,,Afghanistan,27657145,0.583587,2020-01-23,Fatalities,0
3,4,,,Afghanistan,27657145,0.583587,2020-01-24,Fatalities,0


(26180, 9)


In [None]:
fig1 = px.pie(df_cc, values='TargetValue', color_discrete_sequence=px.colors.qualitative.Light24,template="plotly_dark", names='Country_Region', title='Confirmed cases of all countries')
fig1.update_traces(textposition='inside')
fig1.update_layout(font_size=16)
fig1.show()

fig2 = px.pie(df_fat, values='TargetValue', color_discrete_sequence=px.colors.qualitative.Light24,template="plotly_dark", names='Country_Region', title='Fatalities of all countries')
fig2.update_traces(textposition='inside')
fig2.update_layout(font_size=16)
fig2.show()


We make use of Pandas data manipulation methods to visualize the trends over time for the top countries. 

In [None]:
df_cc_gr = df_cc.copy()
df_cc_gr = df_cc_gr.groupby(['Country_Region'], as_index=False)['TargetValue'].sum()
df_cc_gr = df_cc_gr.nlargest(12, 'TargetValue')

df_cc2 = df_cc.copy()
df_cc2 = df_cc2.loc[df_cc2['Country_Region'].isin(df_cc_gr.Country_Region)]
df_cc2['cum_target'] = df_cc2.groupby(['Country_Region'])['TargetValue'].cumsum()

df_fat_gr = df_fat.copy()
df_fat_gr = df_fat_gr.groupby(['Country_Region'], as_index=False)['TargetValue'].sum()
df_fat_gr = df_fat_gr.nlargest(12, 'TargetValue')

df_fat2 = df_fat.copy()
df_fat2 = df_fat2.loc[df_fat2['Country_Region'].isin(df_fat_gr.Country_Region)]
df_fat2['cum_target'] = df_fat2.groupby(['Country_Region'])['TargetValue'].cumsum()


The followings are confirmed cases per day and cumulative confirmed cases in each date for the top 12 countries. 

In [None]:
fig3 = px.line(df_cc2, x='Date', y='TargetValue', color_discrete_sequence=px.colors.qualitative.Light24, 
               color='Country_Region',  title='Confirmed cases per day for 12 countries with the most cases over time', template="plotly_dark", labels={'TargetValue': 'Confirmed Cases'})
fig3.update_layout(font_size=16)
fig3.show()


fig4 = px.line(df_cc2, x='Date', y='cum_target', color='Country_Region',
               color_discrete_sequence=px.colors.qualitative.Light24,  title='Cumulative Confirmed cases for 12 countries with the most cases over time', template="plotly_dark", labels={'cum_target': 'Confirmed Cases'})
fig4.update_layout(font_size=16)
fig4.show()



And here we visualize the trends for the fatalities.

In [None]:
fig5 = px.line(df_fat2, x='Date', y='TargetValue', color_discrete_sequence=px.colors.qualitative.Light24,
               color='Country_Region',  title='Fatalities per day for 12 countries with the most fatalities over time', template="plotly_dark", labels={'TargetValue': 'Fatalities'})
fig5.update_layout(font_size=16)
fig5.show()



fig6 = px.line(df_fat2, x='Date', y='cum_target',  color_discrete_sequence=px.colors.qualitative.Light24,
               color='Country_Region',  title='Cumulative Fatalities for 12 countries with the most cases over time', template="plotly_dark", labels={'cum_target': 'Fatalities'})
fig6.update_layout(font_size=16)
fig6.show()


Using Plotly animation, we can see the countries race in the number of confirmed cases and fatalities over time.

In [None]:
df_anim_cc = df_cc2[df_cc2["Date"].dt.strftime('%m').astype(int)>=3]
fig7 = px.bar(df_anim_cc, y="Country_Region", x='cum_target', orientation='h', color="Country_Region", labels={'cum_target': 'Confirmed Cases'},
                    hover_name="Country_Region", animation_frame=df_anim_cc["Date"].dt.strftime('%m-%d'),
                    title='Confirmed cases over time', range_x=[0, df_cc2['cum_target'].max()],
                   color_discrete_sequence=px.colors.qualitative.Light24, template="plotly_dark")
fig7.update_layout(font_size=16, yaxis={'categoryorder':"total ascending"})
fig7.show()

df_anim_fat = df_fat2[df_fat2["Date"].dt.strftime('%m').astype(int)>=3]

fig8 = px.bar(df_anim_fat, y="Country_Region", x='cum_target', orientation='h', color="Country_Region", labels={'cum_target': 'Fatalities'},
            animation_frame=df_anim_fat["Date"].dt.strftime('%m-%d'),
              title='Fatalities over time',range_x=[0, df_fat2['cum_target'].max()],
            color_discrete_sequence=px.colors.qualitative.Light24, template="plotly_dark")
fig8.update_layout(font_size=16, yaxis={'categoryorder':"total ascending"})
fig8.show()
fig8.write_html("fig8.html")


And here, we can see the countries race in the number of confirmed cases and fatalities over time on the world map.

In [None]:

df_cc3 = df_cc.copy()
df_cc3['cum_target'] = df_cc3.groupby(['Country_Region'])['TargetValue'].cumsum()

df_fat3 = df_fat.copy()
df_fat3['cum_target'] = df_fat3.groupby(['Country_Region'])['TargetValue'].cumsum()

In [None]:

fig9 = px.choropleth(df_cc3, locations="Country_Region", locationmode='country names', color=np.log(df_cc3['cum_target']),
                    labels={'color': 'Confirmed Cases (log)'}, hover_name="Country_Region",
                     animation_frame=df_cc3["Date"].dt.strftime('%m-%d'),
                    title='Confirmed cases over time', template="plotly_dark", color_continuous_scale='Reds')
# fig.update(layout_coloraxis_showscale=False)
fig9.update_layout(font_size=16)
fig9.show()

fig10 = px.choropleth(df_fat3, locations="Country_Region", locationmode='country names', color=np.log(df_fat3['cum_target']),
                    labels={'color': 'Fatalities (log)'}, hover_name="Country_Region", animation_frame=df_fat3["Date"].dt.strftime('%m-%d'),
                    title='Fatalities over time', template="plotly_dark", color_continuous_scale='Reds')
# fig.update(layout_coloraxis_showscale=False)
fig10.update_layout(font_size=16)
fig10.show()

In the following, the confirmed cases and fatalities of US states are shown. 


In [None]:
df_cc_usa = df_train[(df_train['Target'] == 'ConfirmedCases') & (df_train['Country_Region'] == 'US') & (
pd.notnull(df_train['Province_State'])) & (
pd.isnull(df_train['County']))]
df_fat_usa = df_train[(df_train['Target'] == 'Fatalities') & (df_train['Country_Region'] == 'US') & (
pd.notnull(df_train['Province_State']))& (
pd.isnull(df_train['County']))]
fig11 = px.pie(df_cc_usa, values='TargetValue', names='Province_State',
             title='Confirmed cases of USA',template="plotly_dark")
fig11.update_traces(textposition='inside', textinfo='percent+label')
fig11.show()
fig12 = px.pie(df_fat_usa, values='TargetValue', names='Province_State',
             title='Fatalities of USA',template="plotly_dark")
fig12.update_traces(textposition='inside', textinfo='percent+label')
fig12.show()

From the figure below, we can see the US states confirmed cases and fatalities with respect to their population. 

In [None]:
fig13 = px.treemap(df_cc_usa, path=['Province_State'], values='TargetValue',
                  color='Population', title='Confirmed cases of USA', hover_data=['Province_State'], template="plotly_dark",color_continuous_scale='Reds')
fig13.show()
fig13 = px.treemap(df_fat_usa, path=['Province_State'], values='TargetValue',
                  color='Population', title='Fatalities of USA', hover_data=['Province_State'], template="plotly_dark",color_continuous_scale='Reds')
fig13.show()

1. We can see that California has the darkest color which means it has the highest popluation, even though it's rectangle is much smaller than Newyork. 
2. In this plot, we can compare the states performance in controlling the virus. For example, Massachsetts, has one of the highest fatalities, even though its poluation is lower than many other states. 