In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/covid19-timeline-analysis/owid-covid-data.csv


In [2]:
!pip install plotly



In [3]:
import plotly.graph_objs as go
import plotly.express as px

# Before I Start

This time I will be working with ***plotly*** instead of ***seaborn***

## Assumptions

- There should be two waves of Covid outburst one around april of 2020. The second one at the end of 2020/ beginning of 2021.
- Testing numbers should be increasing because of introduction of green pass and restrictions not having it.
- Vaccination should be ramping up at the start of it and got lower around summer because of holiday season.
- Countries with better gdp per capita and younger median age should be suffering less from covid

## Questions I Want to Answer

0. What is total number of Covid cases, tests made and vaccinated people?
1. How dense population is in every country?
2. What is Median age in every country and how does it look compared to others?
3. How covid progressed by total cases, new cases, total deaths, new deaths?
4. How testing rate changed during the pandemic?
5. How test per case changed in Lithuania, Japan and USA.
6. Vaccination progess, how it change?
7. How strictly goverments responded with restrictions?
8. Does countries that have more smoking people suffer more from covid?
9. How gdp_per_capita and meadian age correlates with total cases?


# Basic Insights on Data

## Loading Data

In [4]:
df = pd.read_csv('../input/covid19-timeline-analysis/owid-covid-data.csv')
df.head()

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality
0,AFG,Asia,Afghanistan,2020-02-24,1.0,1.0,,,,,...,,597.029,9.59,,,37.746,0.5,64.83,0.511,
1,AFG,Asia,Afghanistan,2020-02-25,1.0,0.0,,,,,...,,597.029,9.59,,,37.746,0.5,64.83,0.511,
2,AFG,Asia,Afghanistan,2020-02-26,1.0,0.0,,,,,...,,597.029,9.59,,,37.746,0.5,64.83,0.511,
3,AFG,Asia,Afghanistan,2020-02-27,1.0,0.0,,,,,...,,597.029,9.59,,,37.746,0.5,64.83,0.511,
4,AFG,Asia,Afghanistan,2020-02-28,1.0,0.0,,,,,...,,597.029,9.59,,,37.746,0.5,64.83,0.511,


## Basic information about the structure of dataset

In [5]:
df.shape

(103143, 60)

In [6]:
len(df.iso_code.unique())

231

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103143 entries, 0 to 103142
Data columns (total 60 columns):
 #   Column                                 Non-Null Count   Dtype  
---  ------                                 --------------   -----  
 0   iso_code                               103143 non-null  object 
 1   continent                              98330 non-null   object 
 2   location                               103143 non-null  object 
 3   date                                   103143 non-null  object 
 4   total_cases                            99197 non-null   float64
 5   new_cases                              99194 non-null   float64
 6   new_cases_smoothed                     98184 non-null   float64
 7   total_deaths                           88953 non-null   float64
 8   new_deaths                             89109 non-null   float64
 9   new_deaths_smoothed                    98184 non-null   float64
 10  total_cases_per_million                98670 non-null   

In [8]:
df.isna().sum()

iso_code                                      0
continent                                  4813
location                                      0
date                                          0
total_cases                                3946
new_cases                                  3949
new_cases_smoothed                         4959
total_deaths                              14190
new_deaths                                14034
new_deaths_smoothed                        4959
total_cases_per_million                    4473
new_cases_per_million                      4476
new_cases_smoothed_per_million             5481
total_deaths_per_million                  14704
new_deaths_per_million                    14548
new_deaths_smoothed_per_million            5481
reproduction_rate                         20176
icu_patients                              92448
icu_patients_per_million                  92448
hosp_patients                             90222
hosp_patients_per_million               

In [9]:
df.describe()

Unnamed: 0,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,...,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality
count,99197.0,99194.0,98184.0,88953.0,89109.0,98184.0,98670.0,98667.0,97662.0,88439.0,...,62295.0,92382.0,94822.0,72229.0,71176.0,46418.0,84121.0,97916.0,92508.0,3624.0
mean,1124722.0,6072.262425,6083.795353,30077.28,145.766802,131.539926,13882.078745,76.578968,76.711418,304.10861,...,13.422692,258.627553,7.939718,10.57467,32.702259,50.81425,3.026876,73.239068,0.727192,18.218438
std,7666606.0,37758.463944,37435.972295,179541.4,797.217897,741.775158,25189.980677,200.620817,158.654683,544.503228,...,19.96635,119.095459,4.165228,10.476721,13.488423,31.757912,2.457652,7.55799,0.150353,35.742831
min,1.0,-74347.0,-6223.0,1.0,-1918.0,-232.143,0.001,-2153.437,-276.825,0.001,...,0.1,79.37,0.99,0.1,7.7,1.188,0.1,53.28,0.394,-95.59
25%,1374.0,2.0,7.714,57.0,0.0,0.0,272.535,0.227,1.316,8.347,...,0.6,167.295,5.31,1.9,21.6,19.351,1.3,67.92,0.602,0.4475
50%,14419.0,75.0,93.714,423.0,2.0,1.429,1932.3695,8.738,11.496,54.754,...,2.2,242.648,7.11,6.3,31.4,49.839,2.4,74.62,0.748,7.44
75%,154445.0,829.0,870.0,4082.0,18.0,14.571,14805.33525,70.8065,78.96875,335.1945,...,21.2,329.635,10.08,19.3,41.1,83.241,3.861,78.74,0.848,24.0975
max,189997800.0,905993.0,826368.143,4082335.0,18060.0,14735.857,184727.885,18293.675,4083.5,5915.562,...,77.6,724.417,30.53,44.0,78.1,100.0,13.8,86.75,0.957,409.9


In [10]:
df.date.min(), df.date.max()

('2020-01-01', '2021-07-17')

### Observations

We have 60 features and 103143 entries from 231 countries. We have a lot of missing data in columns that will need further attention in working with it. We also have some negative values at columns new_cases, new_deaths, new_cases_per_million, I'll have to keep that in mind and investigate the cause of it; There are plenty of columns that has no use for us so I propobly should drop it to make working with data more efective. Finally we are working with data from 2020-01-01 -> 2021-07-17 and it also should be converted to datetime.

#### Stringency index

The nine metrics used to calculate the Stringency Index are: school closures; workplace closures; cancellation of public events; restrictions on public gatherings; closures of public transport; stay-at-home requirements; public information campaigns; restrictions on internal movements; and international travel controls.

A higher score indicates a stricter response (i.e. 100 = strictest response). If policies vary at the subnational level, the index is shown as the response level of the strictest sub-region.

It’s important to note that this index simply records the strictness of government policies. It does not measure or imply the appropriateness or effectiveness of a country’s response. A higher score does not necessarily mean that a country’s response is ‘better’ than others lower on the index.

# Cleaning up dataset

In [11]:
df.columns

Index(['iso_code', 'continent', 'location', 'date', 'total_cases', 'new_cases',
       'new_cases_smoothed', 'total_deaths', 'new_deaths',
       'new_deaths_smoothed', 'total_cases_per_million',
       'new_cases_per_million', 'new_cases_smoothed_per_million',
       'total_deaths_per_million', 'new_deaths_per_million',
       'new_deaths_smoothed_per_million', 'reproduction_rate', 'icu_patients',
       'icu_patients_per_million', 'hosp_patients',
       'hosp_patients_per_million', 'weekly_icu_admissions',
       'weekly_icu_admissions_per_million', 'weekly_hosp_admissions',
       'weekly_hosp_admissions_per_million', 'new_tests', 'total_tests',
       'total_tests_per_thousand', 'new_tests_per_thousand',
       'new_tests_smoothed', 'new_tests_smoothed_per_thousand',
       'positive_rate', 'tests_per_case', 'tests_units', 'total_vaccinations',
       'people_vaccinated', 'people_fully_vaccinated', 'new_vaccinations',
       'new_vaccinations_smoothed', 'total_vaccinations_per_hun

In [12]:
cols_to_drop = ['continent', 'reproduction_rate', 'icu_patients', 'icu_patients_per_million', 'hosp_patients', 'hosp_patients_per_million',
               'weekly_icu_admissions_per_million', 'tests_units', 'extreme_poverty', 'cardiovasc_death_rate',
               'diabetes_prevalence', 'handwashing_facilities', 'life_expectancy', 'human_development_index', 'excess_mortality']

In [13]:
df.drop(columns=cols_to_drop, inplace=True)
df.columns

Index(['iso_code', 'location', 'date', 'total_cases', 'new_cases',
       'new_cases_smoothed', 'total_deaths', 'new_deaths',
       'new_deaths_smoothed', 'total_cases_per_million',
       'new_cases_per_million', 'new_cases_smoothed_per_million',
       'total_deaths_per_million', 'new_deaths_per_million',
       'new_deaths_smoothed_per_million', 'weekly_icu_admissions',
       'weekly_hosp_admissions', 'weekly_hosp_admissions_per_million',
       'new_tests', 'total_tests', 'total_tests_per_thousand',
       'new_tests_per_thousand', 'new_tests_smoothed',
       'new_tests_smoothed_per_thousand', 'positive_rate', 'tests_per_case',
       'total_vaccinations', 'people_vaccinated', 'people_fully_vaccinated',
       'new_vaccinations', 'new_vaccinations_smoothed',
       'total_vaccinations_per_hundred', 'people_vaccinated_per_hundred',
       'people_fully_vaccinated_per_hundred',
       'new_vaccinations_smoothed_per_million', 'stringency_index',
       'population', 'population_den

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103143 entries, 0 to 103142
Data columns (total 45 columns):
 #   Column                                 Non-Null Count   Dtype  
---  ------                                 --------------   -----  
 0   iso_code                               103143 non-null  object 
 1   location                               103143 non-null  object 
 2   date                                   103143 non-null  object 
 3   total_cases                            99197 non-null   float64
 4   new_cases                              99194 non-null   float64
 5   new_cases_smoothed                     98184 non-null   float64
 6   total_deaths                           88953 non-null   float64
 7   new_deaths                             89109 non-null   float64
 8   new_deaths_smoothed                    98184 non-null   float64
 9   total_cases_per_million                98670 non-null   float64
 10  new_cases_per_million                  98667 non-null   

# Answering Questions

## 0. What is total number of Covid cases, tests made and vaccinated people?

### Total number of covid cases

In [15]:
total_cases = df.groupby('iso_code')['total_cases'].last().sum()
total_cases

603408677.0

### Total number of tests made

In [16]:
total_tests = df.groupby('iso_code')['total_tests'].last().sum()
total_tests

2587356722.0

### Total number of vaccinated people

In [17]:
total_vac = df.groupby('iso_code')['total_vaccinations'].last().sum()
total_vac

11322906681.0

## 1. How dense population is in every country?

In [18]:
df.groupby('iso_code')['population_density'].first()

iso_code
ABW    584.800
AFG     54.422
AGO     23.890
AIA        NaN
ALB    104.871
        ...   
WSM     69.413
YEM     53.508
ZAF     46.754
ZMB     22.995
ZWE     42.729
Name: population_density, Length: 231, dtype: float64

### Sanity check

In [19]:
df.loc[df.iso_code == 'ABW',['population_density']].head()

Unnamed: 0,population_density
4753,584.8
4754,584.8
4755,584.8
4756,584.8
4757,584.8


In [20]:
density_df = df.groupby('iso_code')[['population_density', 'location']].first().reset_index()
density_df.head()

Unnamed: 0,iso_code,population_density,location
0,ABW,584.8,Aruba
1,AFG,54.422,Afghanistan
2,AGO,23.89,Angola
3,AIA,,Anguilla
4,ALB,104.871,Albania


In [21]:
density_df.population_density.min(), density_df.population_density.max()

(0.137, 20546.766)

In [22]:
df.population_density.describe()

count    95774.000000
mean       388.720811
std       1810.480729
min          0.137000
25%         36.253000
50%         83.479000
75%        209.588000
max      20546.766000
Name: population_density, dtype: float64

In [None]:
fig = px.choropleth(density_df, locations='iso_code', color='population_density',
                   hover_name='location', projection='natural earth',
                   title='Population Density',
                   range_color=(0,500),
                   color_continuous_scale='greens'
                   )
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

### Observations

There are few countries that has a massive population density. But those countries are independent regions in other countries in example Monaco and Makau. Plotly by default are not including those regions in map.

***TO DO:*** find a better GeoJSON 

## 2. What is Median age in every country and how does it look compared to others?

In [23]:
age_df = df.groupby('iso_code')[['location', 'median_age']].first().reset_index()
age_df.head()

Unnamed: 0,iso_code,location,median_age
0,ABW,Aruba,41.2
1,AFG,Afghanistan,18.6
2,AGO,Angola,16.8
3,AIA,Anguilla,
4,ALB,Albania,38.0


In [24]:
age_df.describe()

Unnamed: 0,median_age
count,191.0
mean,30.303665
std,9.093852
min,15.1
25%,22.1
50%,29.6
75%,38.7
max,48.2


In [None]:
fig = px.choropleth(age_df, locations='iso_code', color='median_age',
                   hover_name='location', projection='orthographic',
                   title='Population Density',
                   range_color=(15,49),
                   color_continuous_scale='greens'
                   )
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

## 3. How covid progressed by total cases, new cases, total deaths, new deaths?

### Covid Progression by Total Cases

In [25]:
df.sort_values('date', inplace=True)

In [None]:
fig = px.choropleth(df.loc[:,['iso_code', 'total_cases', 'date']], locations='iso_code',
                    color='total_cases',
                    animation_frame='date', 
                    title='Total Cases of Covid19',
                    height=750,
                    color_continuous_scale='greens'
)

fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

### Covid Progression by New Cases

In [None]:
fig = px.choropleth(df.loc[:,['iso_code', 'new_cases_smoothed', 'date']], locations='iso_code',
                    color='new_cases_smoothed',
                    animation_frame='date', 
                    title='New Cases of Covid19',
                    height=750,
                    color_continuous_scale='greens'
)

fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

### Covid Progression by Total Deaths

In [None]:
fig = px.choropleth(df.loc[:,['iso_code', 'total_deaths', 'date']], locations='iso_code',
                    color='total_deaths',
                    animation_frame='date', 
                    title='Total Deaths of Covid19',
                    height=750,
                    color_continuous_scale='greens'
)

fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

### Covid Progression by New Deaths

In [None]:
fig = px.choropleth(df.loc[:,['iso_code', 'new_deaths_smoothed', 'date']], locations='iso_code',
                    color='new_deaths_smoothed',
                    animation_frame='date', 
                    title='New Deaths of Covid19',
                    height=750,
                    color_continuous_scale='greens'
)

fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

## 4. How testing rate changed during the pandemic?

In [None]:
fig = px.choropleth(df.loc[:,['iso_code', 'new_tests', 'date']], locations='iso_code',
                    color='new_tests',
                    animation_frame='date', 
                    title='New Tests of Covid19',
                    height=750,
                    color_continuous_scale='greens'
)

#fig.update(layout_coloraxis_showscale=False)
fig.show()

## 5. How tests per case changed in Lithuania, Japan and USA.

In [None]:
df.sort_index(inplace=True)
df.head()

In [None]:
df.columns

In [None]:
ds = df.loc[(df.iso_code == 'JPN') | (df.iso_code == 'USA') | (df.iso_code == 'LTU'), ['date', 'iso_code', 'tests_per_case']]

In [None]:
fig = px.line(ds, x='date', y='tests_per_case', color='iso_code')

fig.update_xaxes(
    dtick="M1",
    tickformat='%b\n%Y'
)

fig.show()

## 6. Vaccination progess, how it change?

In [None]:
vac = df.groupby('date')['new_vaccinations'].agg(['sum'])
vac.reset_index(inplace=True)

vac.rename(columns={'sum':'Vaccines'}, inplace=True)
vac = vac[vac.Vaccines != 0]
vac.head()

In [None]:
fig = px.line(vac, x='date', y='Vaccines')

fig.update_xaxes(
    dtick="M1",
    tickformat='%b\n%Y'
)

fig.show()

## 7. How strictly goverments responded with restrictions?

In [None]:
fig = px.choropleth(df.loc[:,['iso_code', 'stringency_index', 'date', 'location']], locations='iso_code', color='stringency_index',
                   hover_name='location', projection='natural earth',
                   title='Stringency Index',
                   range_color=(0,100),
                   color_continuous_scale='greens'
                   )
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

## 8. Does countries that have more smoking people suffer more from covid?

In [26]:
df.columns

Index(['iso_code', 'location', 'date', 'total_cases', 'new_cases',
       'new_cases_smoothed', 'total_deaths', 'new_deaths',
       'new_deaths_smoothed', 'total_cases_per_million',
       'new_cases_per_million', 'new_cases_smoothed_per_million',
       'total_deaths_per_million', 'new_deaths_per_million',
       'new_deaths_smoothed_per_million', 'weekly_icu_admissions',
       'weekly_hosp_admissions', 'weekly_hosp_admissions_per_million',
       'new_tests', 'total_tests', 'total_tests_per_thousand',
       'new_tests_per_thousand', 'new_tests_smoothed',
       'new_tests_smoothed_per_thousand', 'positive_rate', 'tests_per_case',
       'total_vaccinations', 'people_vaccinated', 'people_fully_vaccinated',
       'new_vaccinations', 'new_vaccinations_smoothed',
       'total_vaccinations_per_hundred', 'people_vaccinated_per_hundred',
       'people_fully_vaccinated_per_hundred',
       'new_vaccinations_smoothed_per_million', 'stringency_index',
       'population', 'population_den

In [None]:
smokers = df.groupby('iso_code')[['female_smokers', 'male_smokers', 'total_cases_per_million']].last()

In [None]:
smokers = smokers.dropna().sort_values('total_cases_per_million')

In [None]:
smokers['total_smokers'] = smokers['female_smokers'] + smokers['male_smokers']

In [None]:
smokers.loc[:,['total_cases_per_million', 'total_smokers']].corr()

In [None]:
smokers.head(10), smokers.tail(10)

### Observations

Looking at correlation between total_smokers and total_cases_per_million we can se weak correlation, which implies that there could be some small proff that countries that have more people that smoke has a greater risk having a more covid cases. That being said correlation is not a perfect way to determen causality and looking at top 10 and bottom 10 countries by covid cases per million people we can see that percentage of people that smoke varies a little, but are very similar.

## 9. How gdp_per_capita and meadian age correlates with total cases?

In [30]:
gdp_age = df.groupby('iso_code')[['gdp_per_capita', 'median_age', 'total_cases_per_million']].last()
gdp_age = gdp_age.dropna().sort_values('total_cases_per_million')
gdp_age.head(10), gdp_age.tail(10)

(          gdp_per_capita  median_age  total_cases_per_million
 iso_code                                                     
 TZA             2683.304        17.7                    8.521
 FSM             3299.464        23.0                    8.694
 VUT             2921.909        23.1                   13.023
 WSM             6021.557        22.0                   15.120
 KIR             1981.132        23.2                   16.744
 SLB             2205.923        20.8                   29.117
 CHN            15308.712        38.7                   64.128
 NER              926.000        15.1                  230.226
 YEM             1479.147        20.3                  233.924
 TCD             1768.153        16.7                  302.206,
           gdp_per_capita  median_age  total_cases_per_million
 iso_code                                                     
 NLD            48472.545        43.2               106022.904
 SWE            46949.283        41.0               10

In [31]:
gdp_age.corr()

Unnamed: 0,gdp_per_capita,median_age,total_cases_per_million
gdp_per_capita,1.0,0.651889,0.449003
median_age,0.651889,1.0,0.602184
total_cases_per_million,0.449003,0.602184,1.0


### Observations

By Spearman correlation we can see that medium to strong positive dependancy between median age and total cases per million and weaker positive dependency between gdp per capita and total cases per million. Once again correlation is not a perfect way to determen causality but lookit at 10 best (least cases) and worst(most cases) we can see tendencies that yourger median age results in smaller number of cases.  

# Final Toughts

To begin with I want to say thank you for [Alexa](https://www.kaggle.com/saumya5679), who provided this wonderful [dataset](https://www.kaggle.com/saumya5679/covid19-timeline-analysis) and who allowed me to practise EDA and data visualisation with Plotly.

1. As expected there is a rise of cases till summer of 2020 then cases starts to drop till it reaches plateu for few months and around Autunm (or Fall if you are American) total cases stats to increase again.
2. New tests amount increases at the middle of covid and starts to drop in 2021 because people is finishing vaccinations and gaining immunity resulting in no tests needed.
3. Vaccinations ramping up tremendously at the start and then plateus with regular spikes as people are getting there vaccines. The total vacines should drop later on because most people will have there vaccines and less population will remain unvaccinated.
4. As expected younger median age countries has lower total cases per million people with correlation 0.6, gdp has lower correlation with around 0.45. Which implies that this assumption was correct.
