
## Influence of socio-economic factors on the mortality rate of the COVID-19 pandemic

In [53]:
# Load image from link
url = 'https://www.nen.nl/media/db875ff6-3769-4f64-b75e-b2747a00bfa2_corona_covid19_GettyImages-1213090148_2.jpg'

# Display image from URL with smaller size and subtitle
from IPython.display import Image, display


# Create an Image instance with the URL
image = Image(url=url)

# Display the image and subtitle
display(image)


### Introduction

The COVID-19 pandemic was the most significant global health crisis of the 21st century, impacting countries worldwide. The virus not only strained healthcare systems but also disrupted economies and daily lives on a big scale. Governments were forced to implement drastic measures such as lockdowns, travel restrictions, and widespread testing to control the spread of the virus. Despite these efforts, the severity and duration of the pandemic's effects varied significantly between countries. Some countries managed to control the outbreak and recover relatively quickly, while others faced prolonged challenges and higher casualty rates. Understanding why some countries suffered less or recovered more quickly is crucial for developing effective countermeasures for future pandemics.

This project aims to analyze the global impact of COVID-19 on health outcomes and how these outcomes were influenced by socioeconomic status. By examining datasets of COVID-19 case numbers, deaths, vaccination rates, and socioeconomic indicators such as GDP, we will explore how the pandemic has affected populations worldwide. Our analysis will focus on identifying patterns and correlations that can explain the diversity of casualty rates across different regions. 

By understanding the key factors that led to better outcomes in certain countries, strategies can be developed to ensure better preparation for future pandemics. This will be useful for building a more resilient global health system that is capable of protecting populations against future health crises. 


### Dataset and Preprocessing

The datasets that we use are the OWID Covid-19 dataset and the GDP per capita, PPP in US dollars dataset. The COVID-19 dataset contains statistics on COVID-19 for every country, through the years 2020-2024. It has variables such as “Total_deaths” and “Total_cases”. The second dataset contains the GDP per capita in PPP in US dollars per country per year. This means it contains the economic output in US dollars per inhabitant. PPP stands for purchasing power parity and it means the differences between countries have been normalized for differences in purchasing power, to make the comparisons more fair.
The idea of the dataset is to give a reliable overview of the economic power of the countries per year.

### Perspective 1:
**Countries with higher GDP and higher vaccination rates have managed the COVID-19 pandemic more effectively, resulting in lower mortality rates and better health outcomes despite high case numbers**

#### Argument 1:
**Higher GDP per capita allows for better health care systems and therefore better access to vaccines, resulting in a lower mortality rate**

To see whether vaccines have had an influence on a lower mortality rate, we plotted GDP per capita against the amount of people vaccinated per hundred. We also plotted excess mortality per million inhabitants against people vaccinated per hundred. Comparing these two graphs should give us insights on the effect of the vaccines and whether countries with a higher GDP per capita truly had better access to these vaccines. 

We have taken the vaccination rates for 2021 because the widespread distribution of COVID-19 vaccines started in 2021 (Rijksoverheid, 2020).

To measure the impact of the pandemic in terms of deaths, we have taken the excess mortality. This data compares the number of deaths during the COVID-19 pandemic compared to the deaths we would have expected, had the pandemic not occurred. This is estimated using a regression model that uses mortality rate data from 2015-2019. The model accounts for seasonal variation and year-to-year trends in mortality.

The advantage of using excess mortality instead of the registered COVID-19 deaths is that in addition to confirmed COVID-19 deaths it also captures the deaths that were not correctly diagnosed and reported. This means it accounts for countries with weak health systems and less robust data collection.

In [84]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

# Load the data
df = pd.read_csv('owid-covid-data.csv')

# Convert the date column to datetime
df['date'] = pd.to_datetime(df['date'])

# Filter the data for the year 2021
df_2021 = df[df['date'].dt.year == 2021].copy()

# Function to fill NaN values with the last available non-NaN value per country
def fill_last_available(df, col):
    df[col] = df.groupby('location')[col].ffill()
    return df

# Fill NaN values for the relevant columns
df_2021 = fill_last_available(df_2021, 'excess_mortality_cumulative_per_million')
df_2021 = fill_last_available(df_2021, 'people_fully_vaccinated_per_hundred')

# Extract the last available data for each country in 2021
df_last_2021 = df_2021.groupby('location').last().reset_index()

# Ensure 'gdp_per_capita' and 'people_fully_vaccinated_per_hundred' are filled for merging
df_last_2021 = fill_last_available(df_last_2021, 'gdp_per_capita')

# Create the scatter plot for GDP vs vaccination
fig_gdp = px.scatter(
    df_last_2021,
    x="people_fully_vaccinated_per_hundred",
    y="gdp_per_capita",
    hover_name="location",
    trendline="ols",
    trendline_color_override='darkblue',
    title="Comparison of GDP per capita and people fully vaccinated per hundred by Country",
    labels={
        "gdp_per_capita": "GDP per capita",
        "people_fully_vaccinated_per_hundred": "People fully vaccinated per hundred"
    }
)

# Create the scatter plot for excess mortality vs vaccination
fig_mortality = px.scatter(
    df_last_2021,
    x="people_fully_vaccinated_per_hundred",
    y="excess_mortality_cumulative_per_million",
    hover_name="location",
    trendline="ols",
    trendline_color_override='darkblue',
    title="Comparison of excess mortality per million and people vaccinated per hundred by country",
    labels={
        "excess_mortality_cumulative_per_million": "Excess mortality per million",
        "people_fully_vaccinated_per_hundred": "People fully vaccinated per hundred"
    },
)

# Create a figure with subplots
fig = go.Figure()

# Add GDP traces, initially visible
for trace in fig_gdp.data:
    fig.add_trace(trace)
    fig.data[-1].visible = True

# Add mortality traces, initially invisible
for trace in fig_mortality.data:
    fig.add_trace(trace)
    fig.data[-1].visible = False

# Set initial title and axis labels
fig.update_layout(
    title="Comparison of GDP per capita and people fully vaccinated per hundred by Country",
    xaxis_title="People fully vaccinated per hundred",
    yaxis_title="GDP per capita"
)

# Update layout to add dropdown menu
fig.update_layout(
    updatemenus=[
        dict(
            buttons=list([
                dict(
                    args=[
                        {"visible": [True] * len(fig_gdp.data) + [False] * len(fig_mortality.data)},
                        {"title.text": "Comparison of GDP per capita and people fully vaccinated per hundred by Country",
                         "xaxis.title.text": "People fully vaccinated per hundred",
                         "yaxis.title.text": "GDP per capita"}
                    ],
                    label="GDP per capita",
                    method="update"
                ),
                dict(
                    args=[
                        {"visible": [False] * len(fig_gdp.data) + [True] * len(fig_mortality.data)},
                        {"title.text": "Comparison of excess mortality per million and people vaccinated per hundred by country",
                         "xaxis.title.text": "People fully vaccinated per hundred",
                         "yaxis.title.text": "Excess mortality per million"}
                    ],
                    label="Excess mortality",
                    method="update"
                )
            ]),
            direction="down",
            showactive=True
        )
    ]
)

# Add annotation
fig.add_annotation(text="Figure 1.1", xref="paper", yref="paper", x=1, y=-0.2, showarrow=False, align="center", font=dict(size=14))

# Show the figure
fig.show()


In [86]:


from IPython.display import display, HTML

df_renamed = df_last_2021.rename(columns={
    'gdp_per_capita': 'GDP per capita',
    'people_fully_vaccinated_per_hundred': 'People fully vaccinated per hundred',
    'excess_mortality_cumulative_per_million': 'Excess mortality cumulative per million'
})

# Calculate the first correlation matrix with corrected column names
corr1 = df_renamed[['People fully vaccinated per hundred', 'Excess mortality cumulative per million']].corr()

# Calculate the second correlation matrix with corrected column names
corr2 = df_renamed[['People fully vaccinated per hundred', 'GDP per capita']].corr()

# Display the first correlation matrix with a larger title
display(corr1.style.set_caption('<span style="font-size: 16px;">Correlation between People fully vaccinated per hundred and Excess mortality cumulative per million</span>'))
print("Figure 1.2")
# Display the second correlation matrix with a larger title
display(corr2.style.set_caption('<span style="font-size: 16px;">Correlation between People fully vaccinated per hundred and GDP per capita</span>'))# Corrected code to rename columns and calculate correlation matrices
print("Figure 1.3")

Unnamed: 0,People fully vaccinated per hundred,Excess mortality cumulative per million
People fully vaccinated per hundred,1.0,-0.413027
Excess mortality cumulative per million,-0.413027,1.0


Figure 1.2


Unnamed: 0,People fully vaccinated per hundred,GDP per capita
People fully vaccinated per hundred,1.0,0.632631
GDP per capita,0.632631,1.0


Figure 1.3


As you can see in figure 1.1, the relationship between the amount of fully vaccinated people per 100 inhabitants and excess deaths per million inhabitants has a correlation coefficient of approximately -0,4 (figure 1.2). In the graph you can see that the trendline goes down for excess mortality, the more people are vaccinated. 

In the second plot of figure 1.1 you can see the relationship between the GDP per capita and again the amount of people fully vaccinated. With a correlation coefficient of approximately 0,6 (figure 1.3), this is a significant correlation between the income of a country per inhabitant and the amount of people vaccinated per 100. In this graph you can see the trendline is slightly rising, meaning that regions with a higher GDP tend to have more vaccinated people.





#### Argument 2: 
**Countries with a lower GDP per capita have less resources to spend on testing and recording data which makes it seem like they are performing better at handling the pandemic**

To visualize how GDP per capita impacts various factors, we can use a parallel categories plot to visualize the correlations between these categories. This type of plot will help us understand how countries with lower GDP per capita might appear to manage the pandemic more effectively due to limited resources for testing and data recording. We have chosen to use data from the year 2021 because this is the year when the majority of people began to receive vaccinations(Rijksoverheid, 2020).


In [56]:
import pandas as pd
import plotly.graph_objs as go

# Load COVID-19 and GDP data
covid_df = pd.read_csv('owid-covid-data.csv')
gdp_df = pd.read_csv('GDP-data.csv', skiprows=4)

# Filter COVID-19 data for 2021 and 2020 and exclude certain locations
covid_2021_df = covid_df[covid_df['date'].str.startswith('2021')]
covid_2020_df = covid_df[covid_df['date'].str.startswith('2020')]
exclude_locations = ['World', 'Upper middle income', 'Lower middle income', 'High income', 'Low income',
                     'European Union', 'North America', 'South America', 'Asia', 'Oceania', 'Africa']
covid_2021_df = covid_2021_df[~covid_2021_df['location'].isin(exclude_locations)]
covid_2020_df = covid_2020_df[~covid_2020_df['location'].isin(exclude_locations)]

# Define variables of interest
variables = ['total_deaths_per_million', 'total_cases_per_million', 'people_vaccinated_per_hundred', 
             'total_tests_per_thousand', 'excess_mortality_cumulative_per_million']

# Compute last values for each variable by location for 2021 and 2020
last_values_2021_dfs = {}
last_values_2020_dfs = {}
for var in variables:
    last_values_2021_dfs[var] = covid_2021_df.groupby('location').last()[var].reset_index()
    last_values_2020_dfs[var] = covid_2020_df.groupby('location').last()[var].reset_index()

# Merge COVID-19 variables for 2021 with GDP data
merged_df = last_values_2021_dfs[variables[0]]
for var in variables[1:]:
    merged_df = pd.merge(merged_df, last_values_2021_dfs[var], on='location', how='left')

# Rename columns for clarity
for var in variables:
    merged_df = merged_df.rename(columns={var: f'{var}_2021'})

# Merge last values of 2020
for var in variables:
    if var != 'people_vaccinated_per_hundred':  # Skip 'people_vaccinated_per_hundred'
        merged_df = pd.merge(merged_df, last_values_2020_dfs[var].rename(columns={var: f'{var}_2020'}), on='location', how='left')

# Adjust variables by subtracting 2020 values from 2021 values, except for 'people_vaccinated_per_hundred'
for var in variables:
    if var != 'people_vaccinated_per_hundred':
        merged_df[var] = merged_df[f'{var}_2021'] - merged_df[f'{var}_2020']
    else:
        merged_df[var] = merged_df[f'{var}_2021']

# Drop unnecessary columns
columns_to_drop = [f'{var}_2021' for var in variables if var != 'people_vaccinated_per_hundred'] + [f'{var}_2020' for var in variables if var != 'people_vaccinated_per_hundred']
merged_df = merged_df.drop(columns=columns_to_drop)

# Merge with GDP data
gdp_df = gdp_df.rename(columns={"Country Name": 'location'})
gdp_df = gdp_df[['location', "2021"]]
final_merged_df = pd.merge(merged_df, gdp_df, on='location', how='inner')
final_merged_df = final_merged_df.rename(columns={"2021": "GDP_2021"})

# Categorical binning for each variable
for var in variables + ['GDP_2021']:
    final_merged_df[f'{var}_category'] = pd.qcut(final_merged_df[var], q=3, labels=['low', 'medium', 'high'])

# Define category orders for each variable
category_orders = {
    'GDP_2021_category': ['low', 'medium', 'high'],
    'total_deaths_per_million_category': ['low', 'medium', 'high'],
    'total_cases_per_million_category': ['low', 'medium', 'high'],
    'people_vaccinated_per_hundred_category': ['low', 'medium', 'high'],
    'total_tests_per_thousand_category': ['low', 'medium', 'high'],
    'excess_mortality_cumulative_per_million_category': ['low', 'medium', 'high'] 
}

# Ensure all variables are treated as categorical
for var in variables + ['GDP_2021']:
    final_merged_df[f'{var}_category'] = final_merged_df[f'{var}_category'].astype('category').cat.add_categories('nan').fillna('nan')

# Define dimensions for Plotly Parcats
dimensions = [
    {'label': 'GDP 2021', 'values': final_merged_df['GDP_2021_category'], 'categoryorder': 'array', 'categoryarray': category_orders['GDP_2021_category']},
    {'label': 'Total Deaths', 'values': final_merged_df['total_deaths_per_million_category'], 'categoryorder': 'array', 'categoryarray': category_orders['total_deaths_per_million_category']},
    {'label': 'Total Cases', 'values': final_merged_df['total_cases_per_million_category'], 'categoryorder': 'array', 'categoryarray': category_orders['total_cases_per_million_category']},
    {'label': 'People Vaccinated', 'values': final_merged_df['people_vaccinated_per_hundred_category'], 'categoryorder': 'array', 'categoryarray': category_orders['people_vaccinated_per_hundred_category']},
    {'label': 'Total Tests', 'values': final_merged_df['total_tests_per_thousand_category'], 'categoryorder': 'array', 'categoryarray': category_orders['total_tests_per_thousand_category']},
    {'label': 'Excess Mortality', 'values': final_merged_df['excess_mortality_cumulative_per_million_category'], 'categoryorder': 'array', 'categoryarray': category_orders['excess_mortality_cumulative_per_million_category']}
]

# Create the Parcats figure
fig = go.Figure(data=[
    go.Parcats(
        dimensions=dimensions,
        line={'color': final_merged_df['GDP_2021_category'].cat.codes, 'colorscale': 'Viridis', 'showscale': False},
        hoverinfo='count+probability',
        arrangement='freeform'
    )
])

# Update layout and display the plot
fig.update_layout(
    title='Parallel Categories Plot of COVID-19 and GDP Data',
    height=600
)



fig.add_annotation(text="Figure 1.4", xref="paper", yref="paper", x=1, y=-0.2, showarrow=False, align="center", font=dict(size=14))



In this parallel categories plot, we excluded outliers in our location data such as continents, income groups, and global aggregates. Additionally, we categorized the values into high, medium, and low by dividing them into three bins. For the total deaths, total cases and excess mortality we used the variables per million, for people vaccinated per hundred and for total tests per thousand.

The plot above shows that countries with low GDP tend to have lower values for total deaths per million and total cases per million, which creates the impression that they managed the pandemic more effectively than high GDP regions. However, these countries also tend to have lower vaccination rates compared to regions with medium and high GDP. Furthermore the plot also shows that lower GDP regions have either a low test rate or the data is unavailable. Additionally, the data on excess mortality rates per million is also often unavailable for lower GDP regions. Based on the plot, we can conclude that the lower reported total deaths and cases per million in low GDP countries may not accurately reflect their handling of the pandemic. Instead, these figures might be influenced by limited resources for testing and data recording. Additionally, the lower vaccination rates and lack of data on excess mortality rates in low GDP regions suggest that the pandemic's true impact may be underreported in these areas.

For this reason, it is important to take a look at countries with a relatively higher GDP per capita in order to know whether the same trend we see globally also occurs when countries with a higher GDP per capita get compared. Therefore, we made two maps, one for the global trend between deaths per million and GDP per capita and one for the European trend. Europe generally speaking has a higher GDP compared to the rest of the world which is why we compared them to the rest of the world.

In [87]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

# Read the datasets
CovidData = pd.read_csv('owid-covid-data.csv')
GDPdata = pd.read_csv('GDP-data.csv', skiprows=4)

# Filter COVID data for 2021 and get the last available data for each location
Covid_2021 = CovidData[CovidData['date'].str.startswith('2021')]
Covid_deaths_2021 = Covid_2021.groupby('location').last().reset_index()

# Preprocess GDP data
GDPdata = GDPdata.rename(columns={'Country Name': 'location'})

# Merge GDP data with COVID data
df = pd.merge(GDPdata, Covid_deaths_2021, on='location', how='inner')

# Filter out non-country entries
non_countries = ['World', 'Upper middle income', 'Lower middle income', 'High income', 'Low income', 
                 'European Union', 'North America', 'South America', 'Asia', 'Oceania', 'Africa']
df = df[~df['location'].isin(non_countries)]

# Create a column to mark European countries
df['is_europe'] = df['continent'] == 'Europe'

# Create the scatter plot for global countries
fig_global = px.scatter(
    df[~df['is_europe']],
    x="total_deaths_per_million",
    y="2021",
    hover_name="location",
    trendline="ols",
    title="Comparison of GDP and Deaths per Million by Country",
    labels={
        "2021": "GDP",
        "total_deaths_per_million": "Total Deaths per Million"
    }
)

# Create the scatter plot for European countries
fig_europe = px.scatter(
    df[df['is_europe']],
    x="total_deaths_per_million",
    y="2021",
    hover_name="location",
    trendline="ols",
    title="Comparison of GDP and Deaths per Million by Country (Europe)",
    labels={
        "2021": "GDP",
        "total_deaths_per_million": "Total Deaths per Million"
    }
)

# Create a figure with subplots
fig = go.Figure()

# Add GDP traces, initially visible
for trace in fig_global.data:
    fig.add_trace(trace)
    fig.data[-1].visible = True

# Add mortality traces, initially invisible
for trace in fig_europe.data:
    fig.add_trace(trace)
    fig.data[-1].visible = False

# Update layout to add dropdown menu and axis titles
fig.update_layout(
    updatemenus=[
        dict(
            buttons=list([
                dict(
                    args=[{"visible": [True] * len(fig_global.data) + [False] * len(fig_europe.data)}],
                    label="Global",
                    method="update"
                ),
                dict(
                    args=[{"visible": [False] * len(fig_global.data) + [True] * len(fig_europe.data)}],
                    label="Europe",
                    method="update"
                )
            ]),
            direction="down",
            showactive=True
        )
    ],
    title="Comparison of GDP per capita and deaths per million by country",
    xaxis_title="Total Deaths per Million",
    yaxis_title="GDP per Capita",
)

# Show the figure
fig.add_annotation(text="Figure 1.5", xref="paper", yref="paper", x=1, y=-0.2, showarrow=False, align="center", font=dict(size=14))

fig.show()


The global graph shows that there is no global correlation between total deaths per million and GDP per capita. But as you can see in the European graph, the general trend is clearly going down. There is a negative correlation between the total deaths per million and the GDP per capita. This shows us that the global correlation might be influenced by other factors in countries with a lower GDP per capita.

### Perspective 2: 
**Countries with a lower GDP per capita have not necessarily managed the covid pandemic worse.**

Although there might be some truth to the statement that money and resources helped lighten the burden of the COVID-19 pandemic, there also might be some other factors at play. For this reason it is important that we explore other possible explainations for having a lower mortality rate. 

#### Argument 1: 
**Countries with a lower GDP per capita do not have more cases and deaths than Countries with a high GDP per capita**

Different regions of the world generally have wildly different socio-economic circumstances. Mapping the countries against their respective mortality rate should provide us an insight into the possible causations for higher and lower mortality rates. We have chosen to try to map the mortality rate across different years because we were interested to see whether the general contrast between high and low mortality rate stayed the same across all years or whether there were changes throughout the years.

In [88]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

# Load data
CovidData = pd.read_csv('owid-covid-data.csv')
GDPdata = pd.read_csv('GDP-data.csv', skiprows=4)

# Filter the data for the years 2020, 2021, and 2022
years = ['2020', '2021', '2022']
CovidData['year'] = CovidData['date'].str[:4]
CovidData = CovidData[CovidData['year'].isin(years)]

# Preprocess GDP data
GDPdata = GDPdata.rename(columns={'Country Name': 'location'})

# Preprocess Covid data for each year
def preprocess_covid_data(year):
    Covid_year = CovidData[CovidData['year'] == year]
    Covid_deaths_year = Covid_year.groupby('location').last()['total_deaths_per_million'].reset_index()
    df_year = pd.merge(GDPdata, Covid_deaths_year, on='location', how='inner')
    df_year = df_year[~df_year['location'].isin([
        'World', 'Upper middle income', 'Lower middle income', 'High income', 
        'Low income', 'European Union', 'North America', 'South America', 
        'Asia', 'Oceania', 'Africa', 'Peru'
    ])]
    return df_year

# Get the cumulative deaths for each year
df_2020 = preprocess_covid_data('2020')
df_2021 = preprocess_covid_data('2021')
df_2022 = preprocess_covid_data('2022')

# Calculate the yearly increase in deaths per million
df_2021['total_deaths_per_million'] = df_2021['total_deaths_per_million'] - df_2020['total_deaths_per_million']
df_2022['total_deaths_per_million'] = df_2022['total_deaths_per_million'] - df_2021['total_deaths_per_million']

# Create a function to generate the choropleth map for a specific year
def create_choropleth(df, year):
    fig = px.choropleth(
        df, 
        locations="Country Code",
        color="total_deaths_per_million",
        hover_name="location",
        color_continuous_scale=px.colors.sequential.Blues,
        range_color=(0, max_deaths_per_million),
        title=f"Deaths per million by Country ({year})"
    )
    fig.update_layout(
        geo=dict(
            showframe=False,
            showcoastlines=False,
            projection_type='equirectangular'
        ),
        height=600
    )
    return fig

# Get the maximum value of total_deaths_per_million for consistent color scaling
max_deaths_per_million = max(
    df_2020['total_deaths_per_million'].max(), 
    df_2021['total_deaths_per_million'].max(), 
    df_2022['total_deaths_per_million'].max()
)

# Generate choropleth maps for each year
fig_2020 = create_choropleth(df_2020, '2020')
fig_2021 = create_choropleth(df_2021, '2021')
fig_2022 = create_choropleth(df_2022, '2022')

# Create a figure with all traces
fig = go.Figure(data=fig_2020.data + fig_2021.data + fig_2022.data)

# Update the layout to include dropdown buttons
fig.update_layout(
    updatemenus=[
        {
            'buttons': [
                {
                    'label': '2020',
                    'method': 'update',
                    'args': [{'visible': [True, False, False]}]
                },
                {
                    'label': '2021',
                    'method': 'update',
                    'args': [{'visible': [False, True, False]}]
                },
                {
                    'label': '2022',
                    'method': 'update',
                    'args': [{'visible': [False, False, True]}]
                }
            ],
            'direction': 'down',
            'showactive': True,
        }
    ],
    geo=dict(
        showframe=False,
        showcoastlines=False,
        projection_type='equirectangular',
    ),
    height=600
)

# Show the figure
fig.show()

In the graph above Peru has been filtered out, as its mortality rate was exceptionaly high which skewed the contrast between high and low mortality rates.

The graph above shows the mortality rate per million in countries across the world. As you can see there are more casualties per million in the Americas and Europe compared to Africa and Asia. This is highlighted by the fact that Peru has been filtered out as to make the contrast clearer between the comparatively high and low mortality rates. This is true for every recorded year of the pandemic. This means that across those three years something, that did not change throughout that period of time, caused some countries to have lower and others to have higher mortality rates.

#### Argument 2: 
**Lower income countries have a lower median age, which results in a lower mortality rate.**

Demographic factors like age could play a significant role in COVID-19 mortality rates. One notable factor is the median age of the population in different countries. Lower-income countries often have a younger population compared to higher-income countries, and this difference could impact COVID-19 mortality rates.

In [89]:
import pandas as pd
import plotly.express as px

# Read the datasets
covid_df = pd.read_csv('owid-covid-data.csv')
gdp_df = pd.read_csv('GDP-data.csv', skiprows=4)

# Convert 'date' column to datetime
covid_df['date'] = pd.to_datetime(covid_df['date'])

# Extract data for 2021 and 2020
covid_2021_df = covid_df[covid_df['date'].dt.year == 2021]
covid_2020_df = covid_df[covid_df['date'].dt.year == 2020]

# Find the last available date for each location in 2021 and 2020
last_dates_2021 = covid_2021_df.groupby('location')['date'].idxmax()
last_dates_2020 = covid_2020_df.groupby('location')['date'].idxmax()

# Filter the dataframe to only include rows with the last available date for each location
covid_last_2021 = covid_2021_df.loc[last_dates_2021]
covid_last_2020 = covid_2020_df.loc[last_dates_2020]

# Select required columns
covid_deaths_2021 = covid_last_2021[['location', 'total_deaths_per_million']]
covid_deaths_2020 = covid_last_2020[['location', 'total_deaths_per_million']]

# Merge 2020 and 2021 data on location
covid_deaths = pd.merge(covid_deaths_2021, covid_deaths_2020, on='location', suffixes=('_2021', '_2020'))

# Calculate the difference in total deaths per million between 2021 and 2020
covid_deaths['total_deaths_per_million_2021'] = covid_deaths['total_deaths_per_million_2021'] - covid_deaths['total_deaths_per_million_2020']

# Preprocess GDP data
gdp_df = gdp_df.rename(columns={'Country Name': 'location'})

# Merge GDP data with COVID data
df = pd.merge(gdp_df, covid_deaths, on='location', how='inner')

# Add other required columns from the covid_last_2021 data
other_columns = covid_last_2021[['location', 'continent', 'median_age', 'people_vaccinated_per_hundred', 'total_tests_per_thousand']]
df = pd.merge(df, other_columns, on='location', how='inner')

# Handle NaN values in the '2021' (GDP per capita) column
df['2021'] = df['2021'].replace('..', float('nan')).astype(float)
df = df.dropna(subset=['2021', 'total_deaths_per_million_2021'])

# Create the scatter plot
fig = px.scatter(
    df,
    x="median_age",
    y="2021", 
    size="total_deaths_per_million_2021",
    hover_name="location",
    color='continent',
    title="Comparison of Deaths per Million, Median Age, and GDP per Capita by Country",
    labels={
        "median_age": "Median Age",
        "total_deaths_per_million_2021": "Total Deaths per Million (2021)",
        "2021": "GDP per Capita 2021"
    },
    size_max=60,  # Maximum size of the bubbles
    color_continuous_scale=px.colors.sequential.Blues,
    height=800  # Adjust the height of the figure
)

fig.add_annotation(
    text="Figure 2.2", 
    xref="paper", 
    yref="paper", 
    x=1,  # Adjust x position
    y=-0.1,  # Adjust y position
    showarrow=False, 
    align="center", 
    font=dict(size=14)
)

# Show the figure
fig.show()

The bubble chart takes 3 variables, the median age, GDP per capita and deaths per million. The X axis shows the median age the Y axis GDP per capita and the size shows total deaths per million. European countries generally have high GDP per capita and median ages above 35, with varied death tolls. African nations, with a generally lower GDP per capita and median ages around 25-30, usually report low death tolls. North America shows a wide range of GDPs and median ages, with death tolls not fitting a clear pattern. South America, despite diverse GDPs and median ages, has relatively high death tolls. Oceania's countries have significant differences in GDP per capita and median age but low death tolls, which could be because of geographic isolation. Asia shows economic and median age diversity with generally low death tolls, suggesting other influences besides median age. Overall, the graph shows a slight correlation between GDP per capita and median age, and also a slight correlation between median age and deaths per million.

#### Argument 3: 
**Countries with a lower Human development index, still managed the pandemic better than than countries with a high development index**

The Human Development Index(HDI) is a measure of human development, countries with a lower GDP per capita tend to have a lower HDI. We went to show that eventough a country might be less developed than a richer country, it still could manage the pandemic more effectively despite not having the same resources. We visualise this by using a bubble scatterplot containing the HDI on the x-axis, GDP per capita on the y-axis and the size being the total mortality count. 

In [94]:
import pandas as pd
import plotly.express as px

# Read the datasets
covid_df = pd.read_csv('owid-covid-data.csv')
gdp_df = pd.read_csv('GDP-data.csv', skiprows=4)

# Convert 'date' column to datetime
covid_df['date'] = pd.to_datetime(covid_df['date'])

# Extract data for 2021 and 2020
covid_2021_df = covid_df[covid_df['date'].dt.year == 2021]
covid_2020_df = covid_df[covid_df['date'].dt.year == 2020]

# Find the last available date for each location in 2021 and 2020
last_dates_2021 = covid_2021_df.groupby('location')['date'].idxmax()
last_dates_2020 = covid_2020_df.groupby('location')['date'].idxmax()

# Filter the dataframe to only include rows with the last available date for each location
covid_last_2021 = covid_2021_df.loc[last_dates_2021]
covid_last_2020 = covid_2020_df.loc[last_dates_2020]

# Select required columns
covid_deaths_2021 = covid_last_2021[['location', 'total_deaths_per_million']]
covid_deaths_2020 = covid_last_2020[['location', 'total_deaths_per_million']]

# Merge 2020 and 2021 data on location
covid_deaths = pd.merge(covid_deaths_2021, covid_deaths_2020, on='location', suffixes=('_2021', '_2020'))

# Calculate the difference in total deaths per million between 2021 and 2020
covid_deaths['total_deaths_per_million_2021'] = covid_deaths['total_deaths_per_million_2021'] - covid_deaths['total_deaths_per_million_2020']

# Preprocess GDP data
gdp_df = gdp_df.rename(columns={'Country Name': 'location'})

# Merge GDP data with COVID data
df = pd.merge(gdp_df, covid_deaths, on='location', how='inner')

# Add other required columns from the covid_last_2021 data
other_columns = covid_last_2021[['location', 'continent', 'median_age', 'people_vaccinated_per_hundred', 'total_tests_per_thousand', 'human_development_index', 'population_density', 'icu_patients_per_million']]
df = pd.merge(df, other_columns, on='location', how='inner')

# Handle NaN values in the '2021' (GDP per capita) column
df['2021'] = df['2021'].replace('..', float('nan')).astype(float)
df = df.dropna(subset=['2021', 'total_deaths_per_million_2021'])

fig4 = px.scatter(
    df,
    x="human_development_index",
    y="2021",
    size="total_deaths_per_million_2021",
    hover_name="location",
    trendline='ols',
    trendline_color_override='darkblue',
    title="Human Development Index vs Total Deaths per Million (2021)",
    labels={
        "human_development_index": "Human Development Index",
        "2021": "GDP per capita (2021)"
    },
    height=600
)
fig4.add_annotation(text="Figure 2.3", xref="paper", yref="paper", x=1, y=-0.2, showarrow=False, align="center", font=dict(size=14))

As you can see in the graph, the size of the bubbles on the left handside are almost too small to notice eventhough the bubbles on the right are getting quite big in comparison. This shows us that more developed countries generally have higher mortality rates. The correlation between GDP per capita and the HDI is quite high, this because the GDP per capita is used when calculating the HDI. So less developed countries have overall less deaths, this could be due to a better developed immunity due to their healthcare system being less developed.

### Summary

In summary, eventhough there are reasons suggesting that countries with a higher GDP per capita should have performed better at keeping excessive mortality rates low, the data does not always reflect that. Countries with a higher GDP per capita should have more access to vaccines and the vaccines have been proven to lower excessive mortality rates in countries, but for some reason countries with a higher GDP per capita do not consistently outperform countries with a lower GDP per capita. This could be due to an inconsistency in recording the data, due to a lack of resources, which occurs more often in countries with a lower GDP per capita. On the other hand, there could be legitimate reason to why sometimes countries with a lower GDP per capita have performed better. For example: countries with a lower GDP per capita have a lower median age, which leads to a lower mortality rate. 



### Reflection

The feedback from our TAs was important for improving our data story. They helped us correct our visualizations, especially by suggesting we change a 3D scatter plot that was difficult to interpret, to a clearer bubble chart. This change helped us to have a clearer graph, which was much easier to interpret and improved the readability. The TA also gave us feedback about the variation in our visualizations. According to him we initially used too many scatter plots. This advice led us to diversify our types of plots which enhanced the overall clarity of our data story.

Additionally, our TA also pointed out that our second perspective was too similar to the first. This feedback made us change the second perspective, to create a more contrasting other perspective. This allowed us to present a broader range of views and a more comprehensive analysis. During the peer feedback we also received the same feedback that the TA gave us, about the overuse of scatter plots. They also adviced us to create a bigger variety of data visualizations since they saw we had too many scatter plots. 

By incorporating the design guidelines discussed in the lectures, we critically inspected each visualization using the given criteria. We ensured that each visualization used the most suitable chart type to convey the data effectively, we highlighted key data points and trends, provided necessary context and used scales that accurately represented the data. We also made sure to choose the right colors and shapes to create a contrast, which makes it easier to read. On top of that we repeated similar elements such as the color blue, to add consistency throughout the page. By using these design guidelines we ensured our visualizations were not only aesthetically pleasing but also functionally effective and easy to interpret 

Overall, all the feedback we received helped us a lot by diversifying our data visualizations and elevating our data analysis. This helped us to make a more comprehensive data story from multiple viewpoints.



### Work Distribution
We all worked together on most aspects of the project, as we met up on campus to collaborate on the assignment. We brainstormed ideas and eventually chose the different perspectives and arguments together. Each team member then took responsibility for writing out one of the arguments to explain it in the data story. We also collaborated on writing parts of the introduction together in google docs, ensuring that it reflected all of our perspectives and understanding to make sure we are on the same page. We also worked together to reflect on and incorporate the feedback we received from the TAs and our peers, ensuring that all perspectives and suggestions were integrated into our final project. This teamwork helped us with making sure we have a unified and well rounded data story.

### References

The links for these datasets are: 
OWID Covid-19: https://ourworldindata.org/coronavirus#deaths-and-cases-our-data-source
GDP per capita: https://data.worldbank.org/indicator/NY.GDP.PCAP.PP.CD 
GDP per capita, PPP (current international $) | Data


Rijksoverheid. (2020, December 1). Start corona-vaccinatie mogelijk begin januari. Rijksoverheid. https://www.rijksoverheid.nl/actueel/nieuws/2020/12/01/start-corona-vaccinatie-mogelijk-begin-januari