# BI Exam May 2025: COVID-19 Data

#### Created by Group 7 - Kamilla, Jeanette, Juvena

In [None]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.metrics as sm
from scipy import stats
from scipy.spatial.distance import cdist
from sklearn import metrics
from sklearn import tree
from sklearn import model_selection
from sklearn import preprocessing as prep
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler, QuantileTransformer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.cluster import KMeans
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import explained_variance_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import silhouette_score
from pandas.plotting import scatter_matrix
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, explained_variance_score, mean_squared_error
import statsmodels.api as sm

# Set plot styles for better visualization
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("Set2")

# Data Preparation

### 1. Load the Data

Now that we have our tools ready, the next step is to load the COVID-19 dataset into Python so we can start analyzing it.

In this case, we’re working with a single dataset:

- **OWID COVID-19 Latest Data**: a CSV file that contains country-level information on cases, deaths, vaccinations, testing, and various socioeconomic indicators.

We'll use Pandas to read the CSV file and store it as a DataFrame. To make our code cleaner and reusable, we'll define a simple function that loads the data and performs some initial checks. This way, we can easily reload or replace the dataset if needed in future steps.

In [None]:
# File paths for the covid datasets. (dataset: last updated 2024-08-04)
dataset_covid = 'Data/owid-covid-latest.csv'

# Function to load the Excel files
def load_csv_to_dataframe(file_path):
    # Reads the Excel file and skips the first row if it contains a description or title
    df = pd.read_csv(file_path)
    return df

# Load datasets
print("..Loading COVID-19 dataset")
df_covid = load_csv_to_dataframe(dataset_covid)

### 2. Explore the Data

After loading the dataset, we want to explore it to understand what kind of information it contains and how it's structured.

To do this, we can use several helpful Pandas functions such as `shape`, `types`, `info()`, `head()`, `tail()`, `sample()`, `describe()` and `isnull().sum()`. These functions will give us insights into the number of rows and columns, the data types of each column, a summary of the data, and any missing values. 

This exploration is crucial as it helps us identify potential issues or areas that need further cleaning or transformation before we proceed with our analysis. 

In [None]:
# Check the shape of the DataFrame (rows, columns)
df_covid.shape

In [None]:
# Display the types of attributes (colum names) in the DataFrame
df_covid.dtypes

In [None]:
# Gives an overview of the DataFrame
df_covid.info()

In [None]:
# Display the first 5 rows of the DataFrame
df_covid.head()

In [None]:
# Display the last 5 rows of the DataFrame
df_covid.tail()

In [None]:
# Display a random sample of 5 rows from the DataFrame
df_covid.sample(5)

In [None]:
# Gives summary statistics for all numerical columns in the dataset
df_covid.describe()

##### **2.1 Summary of exploring the data**

After exploring the dataframe, we found that it contains a large number of columns, many of which are not useful for our analysis or modeling goals. While some columns provide valuable information (like total cases, deaths, and vaccination rates), others are either redundant, mostly empty, or irrelevant.

This highlights the need for a thorough data cleaning step to remove unnecessary columns, handle missing values, and focus only on the most relevant features for our machine learning tasks.

### 3. Clean the Data

After loading and exploring the data, we need to clean it to ensure that our analysis is accurate and meaningful. Data cleaning involves several steps, including: checking for missing values, removing duplicates, and converting data types.

We start by doing a bit of cleaning of the big dataset, to remove rows and columns that are not relevant for our futher analysis and before we seperate the data into more specific datasets.

In [None]:
# Check for missing values in the DataFrame
df_covid.isnull().sum()

The output above shows that many columns contain no values at all, so we will remove them to clean up the dataset.

In [None]:
# Before cleaning the data, we want to remove irrelevant OWID aggregate rows—such as those representing high-income, low-income, and other income groupings.
rows_to_remove = ["OWID_UMC", "OWID_WRL", "OWID_LMC", "OWID_LIC", "OWID_HIC"]
df_removed_rows = df_covid[~df_covid["iso_code"].isin(rows_to_remove)]

We are removing the 'low-income countries', 'lower-middle-income countries', 'upper-middle-income countries', 'high-income countries' and 'world' categories because they are too broad and lack specific country-level detail, making it difficult to draw meaningful conclusions without relying on assumptions.

In [None]:
# Checking if the above rows were removed
print(f"{df_covid.shape}")
print(f"Removed the {df_covid.shape[0] - df_removed_rows.shape[0]} OWID rows from the dataframe.")

In [None]:
# We will drop all columns with no values at all like; excess_mortality_cumulative_absolute, excess_mortality_cumulative etc.
df_covid_removed_columns = df_removed_rows.dropna(axis=1, how='all')

In [None]:
# Check whether the columns were removed
print(f"COVID dataframe shape after removing columns: {df_covid_removed_columns.shape}")
print(f"Removed {df_covid.shape[1] - df_covid_removed_columns.shape[1]} columns from the dataframe.")


#### 3.1 Separating the data into different datasets

Now we separate the continent-level , age-level and health-level data into their own DataFrames so that we can clean and process them independently from the country-level data. This allows us to apply different cleaning steps based on the nature of the data, since data may have different structures or missing values compared to individual countries.

##### 3.1.1 Separating the continent-level data

In [None]:
# Function to filter the DataFrame based on a list of values
def filter_dataframe(df, values, filter_type='rows', row_filter_column=None):
    if filter_type == 'rows':
        if row_filter_column is None:
            raise ValueError("Must specify 'row_filter_column' when filtering rows.")
        return df[df[row_filter_column].isin(values)]
    elif filter_type == 'columns':
        # Keep only columns present in df and in values list (avoid key error)
        columns_to_keep = [col for col in values if col in df.columns]
        return df[columns_to_keep]
    else:
        raise ValueError("filter_type must be either 'rows' or 'columns'")

In [None]:
# We are using the function above to seperate the age-level columns from the rest of the data.
columns_to_secure = ["continent", "location", "total_deaths_per_million", "median_age", "aged_65_older", "aged_70_older", "life_expectancy"]
df_age = filter_dataframe(df_covid_removed_columns, columns_to_secure, filter_type='columns')

In [None]:
# Check if the rows were secured
df_age

We now have a new seperate dataframe called `df_age` that contains the age-level data. This DataFrame will be used for further analysis and modeling, while the original `df_covid` DataFrame will focus on country-level data.

In [None]:
# Check for missing values in the DataFrame
df_age.isnull().sum()

In [None]:
# Drop all the rows with NaN values in the 'median_age' column
df_age_cleaned = df_age.dropna(subset=['median_age'])

In [None]:
# Do another check for missing values in the DataFrame
df_age_cleaned.isnull().sum()

There are still some missing values in the `df_age_cleaned` DataFrame, so we will impute them to ensure that our analysis is accurate and meaningful. This step is important because missing values can lead to biased results or errors in our models.

In [None]:
# method for replacing cell with median 
def fill_na_with_median(df, column_name):
    median_value = df[column_name].median()
    print(f"Median of '{column_name}': {median_value:.2f}")
    df[column_name].fillna(median_value, inplace=True)

In [None]:
# Fill NaN values with the median for the columns; total_deaths_per_million, aged_65_older and aged_70_older
fill_na_with_median(df_age_cleaned, "total_deaths_per_million")
fill_na_with_median(df_age_cleaned, "aged_65_older")
fill_na_with_median(df_age_cleaned, "aged_70_older")
df_age_cleaned

In [None]:
# Check for duplicates in the DataFrame
df_age_cleaned.duplicated().sum()

##### 3.1.2 Separating the health-level data

In [None]:
# We are using the same function as above to seperate the health-level columns from the rest of the data.
columns_to_secure = ["continent", "location", "total_deaths_per_million", "cardiovasc_death_rate", "diabetes_prevalence", "female_smokers", "male_smokers", "life_expectancy"]
df_health = filter_dataframe(df_covid_removed_columns, columns_to_secure, filter_type='columns')

In [None]:
# Check if the rows were secured
df_health

We now have a new seperate dataframe called `df_health` that contains health-level data. This DataFrame will be used for further analysis and modeling, while the original `df_covid` DataFrame will focus on country-level data.

In [None]:
# Check for missing values in the DataFrame
df_health.isnull().sum()

In [None]:
# Drop all the rows with NaN values in the 'female_smokers' column
df_health_cleaned = df_health.dropna(subset=['female_smokers'])

In [None]:
# Do another check for missing values in the DataFrame
df_health_cleaned.isnull().sum()

There are still some missing values in the `df_health_cleaned` DataFrame, so we will impute them to ensure that our analysis is accurate and meaningful. This step is important because missing values can lead to biased results or errors in our models.

In [None]:
# Fill NaN values with the median for the columns; cardiovasc_death_rate and male_smokers
fill_na_with_median(df_health_cleaned, "cardiovasc_death_rate")
fill_na_with_median(df_health_cleaned, "male_smokers")
df_health_cleaned

In [None]:
# Check for duplicates in the DataFrame
df_health_cleaned.duplicated().sum()

#### 3.2 Futher cleaning of the country-level data 

We have selected a subset of columns that we consider relevant for our analysis. This subset includes columns that provide information on total cases, deaths and population. By focusing on these columns, we can simplify our analysis and make it easier to draw meaningful conclusions.

In [None]:
# We make a new dataframe with the columns we want to keep for future analysis.
columns_we_want_to_keep = [
    "iso_code", "continent", "location", "total_cases", "total_deaths",
    "total_cases_per_million", "total_deaths_per_million",
    "life_expectancy", "population"]

# Removes all other columns
df_covid = df_covid_removed_columns[columns_we_want_to_keep]

In [None]:
# Check if the columns were removed
df_covid.info()

We then load another dataset so we can add data about the Human Development Index (HDI) for each country. The HDI is a composite index of life expectancy, education, and per capita income indicators, which are used to rank countries into four tiers of human development. This additional information will help us better understand the relationship between COVID-19 and various socioeconomic factors.

In [None]:
# We load the new dataset
hdi = pd.read_csv('Data/human-development-index.csv')

Because we can then add a column with the HDI data for 2021 matching the countries in the covid dataset, because we only need data from the last year.

In [None]:
# Filter HDI for 2021 only
hdi_2021 = hdi[hdi['Year'] == 2021]

# Merge using 'location' from df_covid and 'Entity' from hdi
df_merged = df_covid.merge(
    hdi_2021[['Entity', 'Human Development Index']], 
    left_on='location', 
    right_on='Entity', 
    how='left'
)

# Drop the extra 'Entity' column after merge, since we don't need it
df_merged = df_merged.drop(columns=['Entity'])

# Rename the column in the merged dataframe
df_merged = df_merged.rename(columns={'Human Development Index': 'human_development_index'})

In [None]:
# Check how the dataset look and how we should proceed
df_merged

In [None]:
# Shape of the dataframe after some cleaning
print(f"COVID dataframe shape after removing both some columns and rows: {df_merged.shape}")

We are isolating the remaining rows in the df_covid DataFrame to ensure it contains only country-level data. This allows us to clean the dataset and retain only the features that are most relevant for our analysis.

In [None]:
# Since we seperated the OWID continent fields into it's own dataframe earlier, we now have to remove them again for the df_covid dataframe.
rows_to_remove = ["OWID_AFR", "OWID_ASI", "OWID_EUR", "OWID_EUN", "OWID_NAM", "OWID_OCE", "OWID_SAM"]
df_covid_removed_rows = df_merged[~df_merged['iso_code'].isin(rows_to_remove)]
df_covid_cleaned = df_covid_removed_rows.dropna(subset=['iso_code'])
df_covid_cleaned = df_covid_cleaned.drop(columns=['iso_code'])
df_covid_cleaned        

In [None]:
# Check whether the rows were removed
print(f"COVID dataframe shape after removing rows: {df_covid_cleaned.shape}")
print(f"Removed {df_merged.shape[0] - df_covid_cleaned.shape[0]} rows from the dataframe.")
print(f"Removed {df_merged.shape[1] - df_covid_cleaned.shape[1]} column from the dataframe.")

In [None]:
# Check for missing values in the DataFrame
df_covid_cleaned.isnull().sum()

We can see a lot of missing values for the human_development_index column, so we will impute them with the HDI from the sovereign countries they belong too. This step is important because missing values can lead to biased results or errors in our models.

In [None]:
# Check which locations have missing HDI values
missing_hdi_locations = df_covid_cleaned[df_covid_cleaned['human_development_index'].isna()]
print(missing_hdi_locations['location'].unique())

In [None]:
territory_to_country = {
    'American Samoa': 'United States',
    'Anguilla': 'United Kingdom',
    'Aruba': 'Netherlands',
    'Bermuda': 'United Kingdom',
    'Bonaire Sint Eustatius and Saba': 'Netherlands',
    'British Virgin Islands': 'United Kingdom',
    'Cayman Islands': 'United Kingdom',
    'Cook Islands': 'New Zealand',
    'Curacao': 'Netherlands',
    'Falkland Islands': 'United Kingdom',
    'Faroe Islands': 'Denmark',
    'French Guiana': 'France',
    'French Polynesia': 'France',
    'Gibraltar': 'United Kingdom',
    'Greenland': 'Denmark',
    'Guadeloupe': 'France',
    'Guam': 'United States',
    'Guernsey': 'United Kingdom',
    'Isle of Man': 'United Kingdom',
    'Jersey': 'United Kingdom',
    'Kosovo': 'Serbia',  # or leave as is if Kosovo has its own HDI
    'Martinique': 'France',
    'Mayotte': 'France',
    'Monaco': 'France',
    'Montserrat': 'United Kingdom',
    'Nauru': 'Nauru',
    'New Caledonia': 'France',
    'Niue': 'New Zealand',
    'North Korea': 'North Korea',
    'Northern Mariana Islands': 'United States',
    'Pitcairn': 'United Kingdom',
    'Puerto Rico': 'United States',
    'Reunion': 'France',
    'Saint Barthelemy': 'France',
    'Saint Helena': 'United Kingdom',
    'Saint Martin (French part)': 'France',
    'Saint Pierre and Miquelon': 'France',
    'Sint Maarten (Dutch part)': 'Netherlands',
    'Somalia': 'Somalia',
    'Tokelau': 'New Zealand',
    'Turks and Caicos Islands': 'United Kingdom',
    'United States Virgin Islands': 'United States',
    'Vatican': 'Italy',
    'Wallis and Futuna': 'France'
}

In [None]:
# Map the territory to its sovereign country:
df_covid_cleaned['hdi_source_country'] = df_covid_cleaned['location'].map(territory_to_country)

In [None]:
# Create a lookup for HDI values of sovereign countries
hdi_lookup = df_covid_cleaned.set_index('location')['human_development_index'].to_dict()

In [None]:
# Fill missing HDI values with the sovereign country’s HDI
df_covid_cleaned['human_development_index'] = df_covid_cleaned.apply(
    lambda row: hdi_lookup.get(row['hdi_source_country'], row['human_development_index']) 
    if pd.isna(row['human_development_index']) else row['human_development_index'],
    axis=1
)

df_covid_cleaned.drop(columns=['hdi_source_country'], inplace=True)

In [None]:
# Check for missing values in the DataFrame
df_covid_cleaned.isnull().sum()

In [None]:
# Check which locations have missing HDI values
missing_hdi_locations = df_covid_cleaned[df_covid_cleaned['human_development_index'].isna()]
print(missing_hdi_locations['location'].unique())

We have found data for the human_development_index for 2021 for Nauru and Somalia on the https://hdr.undp.org/ website. We will use this data to fill in the missing values for these two countries in our dataset. We weren't able to find data on the human_development_index for North Korea, so we will impute it with the median value of the HDI for the other countries in the dataset. 

We will also impute the missing values for the columns total_cases, total_deaths, total_cases_per_million, total_deaths_per_million and life_expectancy with the median value.

In [None]:
# method for replacing cell with a value
def replace_cell(df, row_filter, column, value):
    df.loc[row_filter, column] = value

In [None]:
# Replace missing HDI values for Nauru and Somalia and impute North Korea with median value
replace_cell(df_covid_cleaned, df_covid_cleaned['location'] == 'Nauru', 'human_development_index', 0.692)
replace_cell(df_covid_cleaned, df_covid_cleaned['location'] == 'Somalia', 'human_development_index', 0.385)
fill_na_with_median(df_covid_cleaned, "human_development_index")
fill_na_with_median(df_covid_cleaned, "total_cases")
fill_na_with_median(df_covid_cleaned, "total_deaths")
fill_na_with_median(df_covid_cleaned, "total_cases_per_million")
fill_na_with_median(df_covid_cleaned, "total_deaths_per_million")
fill_na_with_median(df_covid_cleaned, "life_expectancy")
df_covid_cleaned


In [None]:
# Check for missing values in the DataFrame
df_covid_cleaned.isnull().sum()

In [None]:
# Check for duplicates in the DataFrame
df_covid_cleaned.duplicated().sum()

### 4. Hypotese 1: Higher population size is associated with higher total COVID-19 deaths, but not necessarily with higher deaths per capita. 


 #### 4.1 Explore

***4.1.1 Descriptive Statistics***

In [None]:
# Check the shape of the DataFrame (rows, columns)
df_covid_cleaned.shape

In [None]:
# Gives an overview of the DataFrame
df_covid_cleaned.info()

In [None]:
df_covid_cleaned.dtypes

In [None]:
# Gives summary statistics for all numerical columns in the dataset
df_covid_cleaned.describe()

In [None]:
df_covid_cleaned.isnull().sum()

***4.1.2 Outliers***

Check for outliers in the df_covid_cleaned

In [None]:
# Check for outliers in covid dataset using IQR method
print("\n..Checking for outliers in the covid dataframe:")

# Loop through selected columns
for column in ['population', 'total_deaths_per_million', 'total_deaths']:
    # Calculate Q1 (25th percentile) and Q3 (75th percentile)
    Q1 = df_covid_cleaned[column].quantile(0.25)
    Q3 = df_covid_cleaned[column].quantile(0.75)
    IQR = Q3 - Q1  # Interquartile Range

    # Define the lower and upper bounds for detecting outliers
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Find rows where the value is outside the normal range
    outliers = df_covid_cleaned[
        (df_covid_cleaned[column] < lower_bound) | 
        (df_covid_cleaned[column] > upper_bound)
    ]

    # Print the number of outliers found for the column
    print(f"  {column}: {len(outliers)} outliers detected")


We used the IQR method to find outliers in the dataset for population, total deaths, and deaths per million. Outliers are values that lie far outside the typical range. This helps us spot extreme countries that might skew the results.

In [None]:
print(outliers)

***4.1.2.1 Conclusion from outliers***

We identified several outliers in the dataset, especially in the columns for population and total deaths. This indicates that some countries—such as the USA and India—have extreme values that differ significantly from the rest, supporting the hypothesis that larger populations are associated with higher total COVID-19 deaths.
In contrast, outliers in deaths per capita—like Belgium and Hungary—show that smaller countries can have disproportionately high death rates. This confirms that while population size influences the total number of deaths, it does not necessarily lead to higher deaths per person. Identifying these outliers ensures more reliable and balanced conclusions in our analysis.

We calculated deaths per capita to better compare COVID-19 impact across countries with different population sizes. While total deaths show how many people died, deaths per capita reveal how severely each country was affected relative to its population. This helps us identify smaller countries with high death rates that total numbers alone might hide.

In [None]:
# Total dødsfald
df_covid_cleaned.boxplot(column='total_deaths_per_million')
plt.title('Outliers in total deaths pr million')
plt.show()

# Dødsfald pr. capita
df_covid_cleaned.boxplot(column='total_deaths') 
plt.title('Outliers in total deaths')
plt.show()

The boxplots show that total deaths have many extreme outliers, likely from large countries with high populations. In contrast, deaths per million have fewer and more moderate outliers, suggesting that while total deaths vary greatly, the death rate per person is more stable. This supports the hypothesis that population size is linked to total deaths but not necessarily to deaths per capita.

***4.1.3 Scatterplots***

In [None]:
# Total deaths vs population
sns.scatterplot(data=df_covid_cleaned, x='population', y='total_deaths_per_million')
plt.title('Population vs Total COVID-19 total deaths per million')
plt.xlabel('Population')
plt.ylabel('Total Deaths per million')
plt.show()

# Deaths per capita vs population
sns.scatterplot(data=df_covid_cleaned, x='population', y='total_deaths')  
plt.title('Population vs COVID-19 total deaths')
plt.xlabel('Population')
plt.ylabel('Total deaths')
plt.show()


The scatterplots show a clear positive trend between population size and total COVID-19 deaths, suggesting that countries with larger populations tend to report more deaths overall. In contrast, there is no clear pattern between population size and deaths per million, supporting the idea that higher population size does not necessarily lead to a higher death rate per person. This aligns with the hypothesis.

***4.1.4 Scaling***

In [None]:
# get statistics
scaled_data = df_covid_cleaned[['total_deaths_per_million']]

print('Mean:', scaled_data['total_deaths_per_million'].mean())
print('Standard Deviation:', scaled_data['total_deaths_per_million'].std())

In [None]:
# draw histogram to visualize them
sns.histplot(scaled_data['total_deaths_per_million'], color='#ee4c2c', bins=50);
plt.ylabel("Number of countries")
plt.show()

 This right-skewed distribution indicates that while some countries experienced extreme death rates, the majority had moderate impacts.

***4.1.5 Standard Scalling***

In [None]:
# reduce all with the mean and scale the data to unit variance
# x = (x-xmean)/std
standard_scaler = StandardScaler()
scaled_data['total_deaths_per_million'] = standard_scaler.fit_transform(scaled_data[['total_deaths_per_million']])

print('Mean:', scaled_data['total_deaths_per_million'].mean()) # almost 0
print('Standard Deviation:', scaled_data['total_deaths_per_million'].std()) # almost 1

In [None]:
# histogram has same shape, but 0,0 is in the middle
plt.figure(figsize=(12, 4))
sns.histplot(scaled_data['total_deaths_per_million'], color='#ee4c2c', bins=50);
plt.ylabel("Number of countries")
plt.tight_layout()
plt.show()

The standardized histogram shows that most countries have death rates per capita below the global average, with a few countries having significantly higher values. This confirms that while total deaths vary, high death rates per capita are limited to only a small number of countries.

***4.1.6 Min-Max Scalling - Normalization***

In [None]:
minmax_scaler = MinMaxScaler()
scaled_data['death_min_max_scaled'] = minmax_scaler.fit_transform(scaled_data[['total_deaths_per_million']])

print('Mean:', scaled_data['death_min_max_scaled'].mean())
print('Standard Deviation:', scaled_data['death_min_max_scaled'].std())

In [None]:
# values are in [0, 1]
sns.histplot(scaled_data['death_min_max_scaled'], color='#ee4c2c', bins=50);
plt.ylabel("Number of countries")
plt.show()

After applying Min-Max scaling, the histogram confirms that most countries have low COVID-19 death rates per capita, with only a few outliers having significantly higher values. This supports the hypothesis that high per capita death rates are rare and concentrated in specific countries.

In [None]:
qtrans = QuantileTransformer()
scaled_data['death_trans_uniform'] = qtrans.fit_transform(scaled_data[['total_deaths_per_million']])

print('Mean:', scaled_data['death_trans_uniform'].mean())
print('Standard Deviation:', scaled_data['death_trans_uniform'].std())

***4.1.7 Boxplot for continents***

In [None]:
sns.boxplot(x='continent', y='total_deaths_per_million', data=df_covid_cleaned)
plt.title("Total deaths per capita by continent")
plt.xticks(rotation=45)
plt.show()

The boxplot shows that Europe has the highest variation and median in COVID-19 deaths per capita, suggesting some European countries were hit especially hard. In contrast, Africa has the lowest values and least variation. This supports the idea that death rates per person differ significantly across continents, with some countries in Europe standing out as high outliers.

***4.1.8 Correlation matrix***

In [None]:
# Vælg relevante kolonner
corr_df = df_covid_cleaned[['population', 'total_deaths', 'total_deaths_per_million']]

# Beregn korrelation
corr_matrix = corr_df.corr()

# Plot heatmap
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

Population and total deaths have a moderate positive correlation (0.46).
This supports our hypothesis, that says countries with larger populations tend to have more total COVID-19 deaths.
Population and deaths per million have a very weak negative correlation (-0.07).
This means population size is not meaningfully related to deaths per capita.
Total deaths and deaths per million have a weak positive correlation (0.27).
Suggests some connection, but not very strong.

***4.1.9 Pearson correlation***

Pearson helps quantify how strongly two numeric variables are linearly related, using a value between −1 and 1.

In [None]:
# Pearson correlation
print(df_covid_cleaned[['population', 'total_deaths', 'total_deaths_per_million']].corr())

The results show a moderate positive correlation between population and total deaths (0.46), but almost no correlation between population and deaths per million (−0.07). 

In [None]:
# Population vs Total Deaths
sns.lmplot(data=df_covid_cleaned, x='population', y='total_deaths')
plt.title('Regression: Population vs Total Deaths')
plt.show()

# Population vs Deaths per Capita
sns.lmplot(data=df_covid_cleaned, x='population', y='total_deaths_per_million')
plt.title('Regression: Population vs Deaths per million')
plt.show()

The regression plots and correlation matrix support the hypothesis. There is a moderate positive correlation (r = 0.46) between population size and total COVID-19 deaths, indicating that countries with larger populations tend to report more deaths overall. This is also reflected in the positive slope of the regression line.

In contrast, there is virtually no correlation (r = -0.07) between population size and deaths per million, suggesting that population size does not predict the death rate per person. The regression line in this case is nearly flat or slightly negative, with a wide confidence interval and scattered data points, showing a very weak or non-existent relationship.

 #### 4.2 Data Modelling

We have chosen the dependent variable `total_deaths` and the independent variable `population` to test our hypothesis.

In [None]:
# Independent variable
X = df_covid_cleaned['population'].values.reshape(-1, 1) # Uafhængig variabel
# Dependent variable
y = df_covid_cleaned['total_deaths'].values.reshape(-1, 1) # Afhængig variabel

In [None]:
# plot all
plt.ylabel('total_deaths')
plt.xlabel('population')
plt.scatter(X, y, color='blue')
plt.show()

In [None]:
# Independent variable
X2 = df_covid_cleaned['population'].values.reshape(-1, 1) # Uafhængig variabel
# Dependent variable
y2 = df_covid_cleaned['total_deaths_per_million'].values.reshape(-1, 1) # Afhængig variabel

In [None]:
# Plot all
plt.ylabel('total_deaths_per_million')
plt.xlabel('population')
plt.scatter(X2, y2, color='blue')
plt.show()

The first scatterplot shows the relationship between population and total deaths. It reveals a positive trend, where countries with larger populations tend to report more total COVID-19 deaths.
The second scatterplot shows the relationship between population and total deaths per million. Unlike the first plot, this one shows no clear trend, indicating that higher population does not necessarily lead to a higher death rate per capita. 

In [None]:
df_covid_cleaned.plot.line(subplots=True)
plt.show()

In [None]:
sns.lmplot(x='population',y='total_deaths',data=df_covid_cleaned,fit_reg=True) 
plt.show()

In [None]:
sns.lmplot(x='population',y='total_deaths_per_million',data=df_covid_cleaned,fit_reg=True) 
plt.show()

***4.2.1 Conclusion – Data Modelling***

The goal of this analysis was not to predict COVID-19 deaths, but to examine whether there is a statistical relationship between population size and total COVID-19 deaths. A linear regression between population and total deaths revealed a moderate positive relationship, supported by both regression plots and a correlation coefficient of approximately 0.46. Conversely, the correlation between population size and deaths per capita was nearly zero, suggesting no clear relationship. This supports the hypothesis that countries with larger populations tend to have more total deaths, but not necessarily higher deaths per capita. No predictive model was developed, as the purpose was to explore and understand relationships in the data – not to forecast future outcomes.


In [None]:
# Train-test split for the regression model for total deaths
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123, test_size=0.15) 

# Train-test split for the regression model for total deaths per million
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, random_state=123, test_size=0.15) 

In [None]:
# The shape of the subsets
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
print(X2_train.shape)
print(y2_train.shape)
print(X2_test.shape)
print(y2_test.shape)

***4.2.2 Train a ML Model***

In [None]:
# Creating an instance of Linear Regression model
myreg = LinearRegression()

# Fit it to our data for total_deaths
myreg.fit(X_train, y_train)
myreg

# Predicting the total_deaths_per_million
myreg2 = LinearRegression()
myreg2.fit(X2_train, y2_train)
myreg2

In [None]:
# Get the calculated coefficients
a = myreg.coef_
b = myreg.intercept_

a2 = myreg2.coef_
b2 = myreg2.intercept_

In [None]:
# Print the calculated coefficients
print(f"The model is a line, y = a * X + b, or y = {a} * x + {b}")
print(f"The model is a line, y = a * X + b, or y = {a2} * x + {b2}")

***4.2.3 Test the Models***

In [None]:
y_predicted = myreg.predict(X_test)
print(f"Predicted values for first model: {y_predicted}")

y2_predicted = myreg2.predict(X2_test)
print(f"Predicted values for second model: {y2_predicted}")

In [None]:
# Visualise the Linear Regression 
plt.title('Linear Regression')
plt.scatter(X, y, color='green')
plt.plot(X_train, a*X_train + b, color='blue')
plt.plot(X_test, y_predicted, color='orange')
plt.xlabel('population')
plt.ylabel('total_deaths')
plt.show()

While there is a weak positive trend between population size and total deaths, there is almost no relationship between population and deaths per capita. This indicates that larger populations are associated with higher total deaths, but not necessarily higher deaths per person.

In [None]:
# Visualise the Linear Regression 
plt.title('Linear Regression')
plt.scatter(X2, y2, color='green')
plt.plot(X2_train, a2*X2_train + b2, color='blue')
plt.plot(X2_test, y2_predicted, color='orange')
plt.xlabel('population')
plt.ylabel('total_deaths_per_million')
plt.show()

Population vs. Total Deaths:
The first plot shows a weak positive linear relationship between population size and total COVID-19 deaths. The upward slope of the regression line suggests that countries with larger populations tend to have more total deaths. However, the data points are widely spread around the line, indicating a low R² value and a weak correlation. This means population size explains only a small part of the variation in total deaths.

Population vs. Deaths per Million (per capita):
The second plot shows a flat to slightly negative regression line, suggesting that there is no significant relationship between population size and deaths per million. The data points are highly scattered, and the regression line does not fit the data well. This supports the hypothesis that larger populations do not necessarily experience higher death rates per person.



In [None]:
# Predict age from length for first model
death_predicted = myreg.predict([[170]])
print(death_predicted)

# Predict age from length for second model
death_predicted2 = myreg2.predict([[170]])
print(death_predicted2)

In [None]:
death_predict = a * 170 + b
print(death_predict)

death_predict2 = a2 * 170 + b2
print(death_predict2)

***4.2.4 Model Evaluation***

We are going to evaluate the model with both population vs total deaths and population vs deaths per million.

In [None]:
# For model 1: Population vs Total Deaths
X = df_covid_cleaned[['population']]       
y = df_covid_cleaned['total_deaths']   

# For model 2: Population vs Total Deaths per million
X2 = df_covid_cleaned[['population']]       
y2 = df_covid_cleaned['total_deaths_per_million']      

# Create and train a model for population vs total deaths
model = LinearRegression()
model.fit(X, y)

# Create and train a model for population vs total deaths pr million
model2 = LinearRegression()
model2.fit(X, y)

In [None]:
# Calculate the R-squared value for the first model
y_pred = model.predict(X)
r2 = r2_score(y, y_pred)
print(f"R-squared for the first model: {r2}")

# Calculate the R-squared value for the second model
y_pred2 = model2.predict(X)
r2 = r2_score(y2, y_pred2)
print(f"R-squared for the second model: {r2}")

In [None]:
# MSE for first model
mse = metrics.mean_squared_error(y_test, y_predicted)
print(mse)

# MSE for second model
mse2 = metrics.mean_squared_error(y2_test, y2_predicted)
print(mse2)

***4.2.5 Calculate R-squared***

In [None]:
# Explained variance score: the proportion of the variance in a dependent variable that can be explained by the model
# 1 for perfect prediction
eV = round(explained_variance_score(y_test, y_predicted), 2)
print('Explained variance score for first model ',eV )

eV2 = round(explained_variance_score(y2_test, y2_predicted), 2)
print('Explained variance score for second model ',eV2 )

In [None]:
# R-squared: the proportion of the variation in the dependent variable that is predictable from the independent variable(s)
rscore = r2_score(y_test, y_predicted)
print('R-squared score for first model ', rscore)

rscore2 = r2_score(y2_test, y2_predicted)
print('R-squared score for second model', rscore2)

In [None]:
plt.scatter(X, y, color='blue', label='Actual data')
plt.plot(X, y_pred, color='red', label='Linear regression')
plt.xlabel('Population')
plt.ylabel('Total Deaths')
plt.title('Linear Regression: Population vs Total Deaths')
plt.legend()
plt.show()

The regression line between population and total deaths shows a very weak relationship, with an R-squared value of just 0.0049. This means population size alone explains less than 1% of the variation in total COVID-19 deaths across countries, suggesting that other factors play a much larger role.
The model shows a very weak fit, with an R² of only 0.00403. This means population size explains less than 1% of the variation in total COVID-19 deaths across countries, suggesting that other factors have a much greater impact.

In [None]:
plt.scatter(X2, y2, color='blue', label='Actual data')
plt.plot(X2, y_pred2, color='red', label='Linear regression')
plt.xlabel('Population')
plt.ylabel('Total Deaths per million')
plt.title('Linear Regression: Population vs Total Deaths per million')
plt.legend()
plt.show()

The second plot looks at the relationship between population size and deaths per million (per capita). The regression line goes upward due to a few outliers, but most data points are tightly grouped near the bottom.
This visual suggests no clear correlation between population size and per capita deaths. In fact, this pattern often indicates that countries with small populations can still have high death rates per person, and large populations don’t guarantee higher per capita impact

**conclusion**

The purpose of this analysis was not to build a predictive machine learning model, but to explore whether population size is associated with the impact of COVID-19, specifically total deaths and deaths per capita. The hypothesis stated that countries with larger populations would report more total deaths, but not necessarily higher deaths per capita.

The findings support this hypothesis. Regression plots, correlation analysis, and summary statistics show a moderate positive relationship between population size and total COVID-19 deaths. This suggests that more populous countries tend to have more deaths overall. However, when comparing population size with deaths per million, the correlation is close to zero, indicating no meaningful relationship. Some small countries showed high death rates per capita, which highlights that severity at the individual level can differ greatly regardless of population size.

R-squared values from the linear regression models were very low (close to 0), confirming that population alone cannot explain the variation in COVID-19 deaths across countries. Outlier detection also revealed that countries like the US and India strongly influence the total death figures, while smaller nations such as Belgium or Hungary stand out for high deaths per capita.

In conclusion, while population size helps explain total COVID-19 deaths to some extent, it does not explain how severely a country was impacted on a per-person basis. This analysis demonstrates the importance of comparing both total and per capita metrics when evaluating the global effects of the pandemic.

---

### 5. Hypotese 2: Countries with a higher Human Development Index (HDI) have experienced lower COVID-19 death rates per capita

We chose to investigate this hypothesis because HDI reflects key aspects of a country’s development, such as healthcare quality, education, and living standards.
It seems reasonable to assume that countries with higher HDI might be better equipped to handle a health crisis like COVID-19, potentially resulting in lower death rates.

 #### 5.1 Explore

In [None]:
df_covid_cleaned

In [None]:
df_covid_cleaned.info()

In [None]:
df_covid_cleaned.describe()


Now that we explored the new cleaned dataframe a bit, we can see that the df_covid_cleaned dataframe contains a more manageable number of columns and rows vs the original dataframe. The columns we have retained are relevant for our analysis, and we have removed unnecessary or redundant features.

##### 5.1.1 Check for outliers in the df_covid_cleaned

The next step in exploring the data is checking for outlier values that are unusually high or low compared to the rest of the data.

We use the IQR (Interquartile Range) method, which is a common way to detect outliers:

-  First, we calculate the first quartile (Q1) and third quartile (Q3) for each selected column.
- The IQR is the difference between Q3 and Q1.
- Any value that falls below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR is considered an outlier.

We apply this method to the two important features regarding our hypotheses: total_deaths_per_million and human_development_index. This helps us find any unusual data points that could affect the results of our analysis.

In [None]:
# Check for outliers in the df_covid_cleaned dataframe using IQR method
print("\n..Checking for outliers in the df_covid_cleaned dataframe:")

# Loop through selected columns
for column in ['total_deaths_per_million', 'human_development_index']:
    # Calculate Q1 (25th percentile) and Q3 (75th percentile)
    Q1 = df_covid_cleaned[column].quantile(0.25)
    Q3 = df_covid_cleaned[column].quantile(0.75)
    IQR = Q3 - Q1  # Interquartile Range

    # Define the lower and upper bounds for detecting outliers
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Find rows where the value is outside the normal range
    outliers = df_covid_cleaned[
        (df_covid_cleaned[column] < lower_bound) | 
        (df_covid_cleaned[column] > upper_bound)
    ]

    # Print the number of outliers found for the column
    print(f"  {column}: {len(outliers)} outliers detected")
    print(outliers[['location', column]])

##### 5.1.2 Conclusion of outliers: 
 There are detected 7 outliers in feature 'total_deaths_per_million' with countries there has very high death toll pr million compared to the other countries. It can have an effect on average and visualizations. 

 These outliers are likely not errors but reflect extreme yet valid data points related to the real impact of COVID-19 in certain countries. For this reason, we’ve chosen to keep them. Its possible that these values could provide valuable insights into how HDI may have had an impact on death rates per capita. Removing them might hide important patterns in the data.





##### 5.1.3 Visualize the impact of HDI on Covid-19 death rate

##### 5.1.3.1 Scatterplot

To explore whether a relationship exists between Human Development Index (HDI) and COVID-19 death rates per million, we use a scatterplot to visualize the distribution and potential correlation between the two variables.

In [None]:
sns.scatterplot(data=df_covid_cleaned, x='human_development_index', y='total_deaths_per_million')
plt.title('Human Development Index vs Total Deaths per Million')
plt.show()

The above scatterplot shows no clear negative correlation between HDI and COVID-19 death rates. High HDI countries vary widely in death rates, suggesting that HDI alone does not explain the differences. Other factors likely play a role.

Countries with low HDI values do not consistently show higher death rates either, reinforcing that HDI alone is not a strong predictor of COVID-19 mortality. 

##### 5.1.3.2 Correlation matrix

In [None]:
corr_matrix = df_covid_cleaned[['human_development_index', 'total_deaths_per_million']].corr()

In [None]:
sns.heatmap(corr_matrix, annot=True)

The correlation matrix above shows a moderate positive correlation (0.47) between Human Development Index and COVID-19 death rates per million. This is surprising, as our hypothesis expected a negative correlation — that higher HDI would be linked to lower death rates. The result suggests that, in this dataset, countries with higher HDI tend to report higher death rates per million. This indicates that HDI alone does not explain the differences, and other factors likely influence the outcomes.

 #### 5.2 Data Modelling

##### 5.2.1 Linear regression (Supervised Machine Learning)

To further investigate the relationship between Human Development Index (HDI) and COVID-19 death rates per million, we apply linear regression. This method helps assess the strength and direction of the relationship between these two variables and allows us to evaluate whether HDI can be used to predict COVID-19 mortality rates across countries.

In [None]:
# Choose dependent and independent variables

# independent
X = df_covid_cleaned[['human_development_index']]

# dependent
y = df_covid_cleaned[['total_deaths_per_million']]

In [None]:
# Splitting the dataset into training and testing sets

# X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123, test_size=0.20)

In [None]:
# the shape of the subsets
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
# Creating an instance of Linear Regression model
myreg = LinearRegression()

In [None]:
# Fit it to our data
myreg.fit(X_train, y_train)
myreg

In [None]:
# Get the calculated coefficients
a = myreg.coef_
b = myreg.intercept_

a

In [None]:
b

In [None]:
print(f"The model is a line, y = a * x + b, or y = {a} * x + {b}")

In [None]:
y_predicted = myreg.predict(X_test)

In [None]:
# Visualise the Linear Regression 
plt.title('Linear Regression: HDI vs COVID-19 Deaths per Million')
plt.scatter(X, y, color='green')
plt.plot(X_train, a*X_train + b, color='blue')
plt.plot(X_test, y_predicted, color='orange')
plt.xlabel('HDI')
plt.ylabel('Deaths per Million')
plt.show()

The above graph visualizes the relationship between HDI and COVID-19 deaths per million using a linear regression model. Each green dot represents a country. The orange line shows the model’s predicted trend based on the data. While there appears to be a slight upward trend, the data points are spread out, especially at higher HDI values, suggesting that the relationship might not be very strong.

In [None]:
# Predict deaths pr million from HDI
hdi_value = 0.85
prediction= myreg.predict([[hdi_value]])
print(f"Predicted death rate for HDI {hdi_value}:", prediction)

In [None]:
manual_prediction = a * hdi_value + b
print("Manual prediction:", manual_prediction)

In [None]:
# Mean Absolute Error (MAE) is the mean of the absolute value of the errors
print("MAE:", metrics.mean_absolute_error(y_test, y_predicted))

# Mean Squared Error (MSE) is the mean of the squared errors
print("MSE:", mean_squared_error(y_test, y_predicted))

# Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_predicted)))

# R-squared: the proportion of the variation in the dependent variable that is predictable from the independent variable(s)
print("R² score:", r2_score(y_test, y_predicted))

##### 5.2.1.1 Conclusion of linear regression
Based on the results, the linear regression model does not perform well.
The average error (MAE) is 774 and the root mean square error (RMSE) is over 1000, which means the predictions are far from the actual values.
The R² score is only 0.28, meaning that HDI explains just 28% of the differences in death rates between countries.
This suggests that HDI alone is not a good predictor of COVID-19 mortality, and that other factors likely play a more important role.

 #### 5.3 Additional Analysis - Nordic Comparison: HDI and Death Rates

Since our earlier results showed that HDI alone does not explain differences in COVID-19 death rates, we chose to examine the Nordic countries. These countries have very similar HDI levels and welfare systems, which makes them ideal for a focused comparison. This analysis helps test whether HDI has a consistent effect within a more uniform group and can either support or weaken our hypothesis.

In [None]:
# Filter the DataFrame for Nordic countries
nordic_countries = ['Denmark', 'Sweden', 'Norway', 'Finland', 'Iceland']
df_nordic = df_covid_cleaned[df_covid_cleaned['location'].isin(nordic_countries)]


In [None]:
# Select relevant columns 
df_nordic_subset = df_nordic[['location', 'human_development_index', 'total_deaths_per_million']]
df_nordic_subset

##### 5.3.1 Bar chart

In [None]:
# Visualize the death rates using a bar chart

plt.figure(figsize=(8, 5))
sns.barplot(data=df_nordic_subset, x='location', y='total_deaths_per_million')
plt.title('COVID-19 Deaths per Million – Nordic Countries')
plt.xlabel('Country')
plt.ylabel('Deaths per Million')
plt.show()

##### 5.3.2 Conclusion of Nordic comparison

Despite similar HDI levels among the Nordic countries, there is a clear variation in COVID-19 death rates per million. Sweden shows the highest rate, while Iceland has the lowest. This indicates that even within a region with high and comparable development, other factors beyond HDI may strongly influence COVID-19 mortality.

 #### 5.4 Conclusion of Hypothesis 2 

The analysis does not support the hypothesis that countries with a higher Human Development Index (HDI) have experienced lower COVID-19 death rates per capita. Although HDI was expected to be a strong predictor, the results show only a weak to moderate positive correlation (0.47), and the linear regression model performed poorly (R² = 0.28). Additionally, the Nordic comparison showed large differences in death rates despite very similar HDI values. This suggests that HDI alone is not sufficient to explain COVID-19 mortality differences, and that other factors likely play a more significant role.

Although HDI reflects general development such as healthcare, education, and living standards, it may not capture specific pandemic-related factors like healthcare system capacity or testing infrastructure. Therefore, HDI alone may not be sufficient to explain variations in COVID-19 death rates, and other, more direct factors likely play a greater role.

----

### 6. Hypotese 3: Countries with a higher life expectancy and older populations (e.g. higher median age, % aged 65+, etc.) have experienced higher COVID-19 death rates

 #### 6.1 Explore

The dependent variable is `total_deaths_per_million`, which represents the total number of COVID-19 deaths per million people in each country. This variable is crucial for understanding the impact of the pandemic on different populations and will be used to assess the relationship with independent variables such as `median_age`, `aged_65_older`, `aged_70_older` and `life_expectancy`. 

It's important to mention, not all countries are represented in the dataset, since the countries with missing data on these variables were removed. This means that the analysis will only include countries for which we have complete data on these variables.

***6.1.1 Descriptive Statistics***

First we look at some of the descriptive statistics of the `df_age_cleaned` DataFrame to get an overview of the data. This includes the mean, median, standard deviation, and other statistics for each column.

In [None]:
# Gives summary statistics for all numerical columns in the dataset
df_age_cleaned.describe()

The data shows large variation across countries in both age-related factors and COVID-19 death rates. Median age ranges from 15 to 48 years, and life expectancy from 53 to nearly 85, indicating diverse population structures. COVID-19 deaths per million also vary widely, from 0 to over 6,600, with a high standard deviation—suggesting age and life expectancy could meaningfully relate to differences in death rates.

***6.1.2 Normality***

To tests whether numeric columns follow a normal distribution, we can use the D'Agostino and Jarque-Bera tests. These tests are designed to assess the skewness and kurtosis of the data, which are key indicators of normality.

In [None]:
# Function to test normality of numeric columns
def check_normality(df):
    num_cols = [col for col in df.select_dtypes(include=['float64', 'int64']).columns if col not in ['location', 'continent']]
    
    rows = []

    for col in num_cols:
        data = df[col]
        skewness = data.skew()
        kurtosis = data.kurt()
        dagostino = stats.normaltest(data)
        jb = stats.jarque_bera(data)

        normal = "No"
        if dagostino.pvalue > 0.05 and jb.pvalue > 0.05 and abs(skewness) < 1:
            normal = "Yes"
        elif dagostino.pvalue > 0.01 and abs(skewness) < 2:
            normal = "Partial"

        rows.append({
            'Column': col,
            'Skewness': round(skewness, 3),
            'Kurtosis': round(kurtosis, 3),
            "D'Agostino p-value": f"{dagostino.pvalue:.2e}",
            "Jarque-Bera p-value": f"{jb.pvalue:.2e}",
            'Normally Distributed?': normal
        })

    return pd.DataFrame(rows)

# Run normality checks on all numeric columns
check_normality(df_age_cleaned)

The normality tests show that none of the key variables follow a normal distribution. All columns—total_deaths_per_million, median_age, aged_65_older, aged_70_older, and life_expectancy—exhibit significant skewness and/or kurtosis, with very low p-values from both the D'Agostino and Jarque-Bera tests, confirming deviations from normality.

For visualization purposes, we want to see how the data looks like in histograms. This will help us understand the distribution of the data and identify any potential outliers or skewness.

In [None]:
def visualize_selected_histograms(df):
    """
    Visualizes the distribution of selected numeric columns from df_age_cleaned with histograms.
    """
    selected_cols = [
        'total_deaths_per_million',
        'life_expectancy',
        'median_age',
        'aged_65_older',
        'aged_70_older'
    ]

    n = len(selected_cols)
    n_cols = 3
    n_rows = (n + n_cols - 1) // n_cols 

    fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 4 * n_rows))
    axes = axes.flatten()

    for i, col in enumerate(selected_cols):
        sns.histplot(df[col], kde=True, ax=axes[i])
        axes[i].set_title(f'Distribution of {col}')
        axes[i].set_xlabel(col.replace('_', ' ').title())

    # Hide unused axes if any
    for j in range(i + 1, len(axes)):
        fig.delaxes(axes[j])

    plt.tight_layout()
    plt.show()

In [None]:
visualize_selected_histograms(df_age_cleaned)

Based on the statistical tests and visualizations, none of the numeric variables appear to be normally distributed. The distributions show that total deaths per million and the age-related variables (especially % aged 65+ and 70+) are right-skewed, meaning most countries have lower values but a few have very high ones. In contrast, life expectancy is more normally distributed, and median age is fairly spread out across countries. 

***6.1.3 Outliers***

To identify outliers in the `df_age_cleaned` DataFrame, we can use the Interquartile Range (IQR) method. This involves calculating the first (Q1) and third quartiles (Q3) for each numeric column, then determining the IQR as Q3 - Q1. Outliers are defined as values that fall below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR.

In [None]:
# Loop through selected columns
for column in ['life_expectancy', 'median_age', 'aged_65_older', 'aged_70_older']:
    # Calculate Q1 (25th percentile) and Q3 (75th percentile)
    Q1 = df_age_cleaned[column].quantile(0.25)
    Q3 = df_age_cleaned[column].quantile(0.75)
    IQR = Q3 - Q1  # Interquartile Range

    # Define the lower and upper bounds for detecting outliers
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Find rows where the value is outside the normal range
    outliers = df_age_cleaned[
        (df_age_cleaned[column] < lower_bound) | 
        (df_age_cleaned[column] > upper_bound)
    ]

    # Print the number of outliers found for the column
    print(f"  {column}: {len(outliers)} outliers detected")

There are no outliers in most variables, except for aged_70_older, which has 1 outlier. This indicates that the data is generally consistent, with only one unusually high or low value in the aged_70_older group. We will keep this outlier, as our dataset is small and it may represent a country with unique characteristics that could be important for our analysis.

To explore how age-related factors impact COVID-19 deaths, we group `life_expectancy`, `aged 65+`, and `aged 70+` into “low” and “high” categories in order to compare death rates across these groups. We define "low" and "high" based on the median values of each variable, which allows us to categorize countries into two groups for analysis.

In [None]:
def boxplot_by_age_factors(df):
    """
    Categorizes life_expectancy, aged_65_older, and aged_70_older into
    'Low' and 'High' groups and plots boxplots of total_deaths_per_million for each.
    """
    # Define the factors to categorize
    factors = ['life_expectancy', 'aged_65_older', 'aged_70_older']
    
    # Categorize into 'Low' and 'High' using median split
    for col in factors:
        median_val = df[col].median()
        df[f'{col}_group'] = df[col].apply(lambda x: 'Low' if x < median_val else 'High')

    # Set up subplots
    fig, axes = plt.subplots(1, 3, figsize=(18, 5))
    
    for i, col in enumerate(factors):
        sns.boxplot(
            x=f'{col}_group',
            y='total_deaths_per_million',
            data=df,
            ax=axes[i],
            palette='Set2'
        )
        axes[i].set_title(f'Deaths per Million by {col.replace("_", " ").title()} Grouped')
        axes[i].set_xlabel('')
        axes[i].set_ylabel('Deaths per Million')
    
    plt.tight_layout()
    plt.show()

In [None]:
boxplot_by_age_factors(df_age_cleaned)      

# Drop the age group columns after visualization
df_age_cleaned = df_age_cleaned.drop(columns=['life_expectancy_group', 'aged_65_older_group', 'aged_70_older_group']) 

We see that some countries with younger populations or lower life expectancy still experienced high death rates. These outliers suggest that other factors that are not represented in the data (e.g., healthcare quality, policy respons, etc) may also play a significant role.

But overall the boxplots show a clear relationship between the age-related variables and total deaths per million. Countries with larger percentages of older populations (aged 65+ and 70+) and longer life expectancies tend to have higher total deaths per million. This suggests that age and life expectancy are important factors in understanding COVID-19 death rates.

***6.1.4 Correlation***

To assess the relationship between age-related factors and COVID-19 death rates, we will use a Heatmap to visualize the correlation matrix of the numeric variables in the `df_age_cleaned` DataFrame. This will help us identify any strong correlations between the variables, particularly between age-related factors and total deaths per million.

In [None]:
# Function creates a correlation matrix from our DataFrame. 
def my_corr(df):
    cormat = df.drop(columns=['continent', 'location']).corr() #checks how strongly each pair of columns are related and drops column 'wine-type'. 
    return cormat

# function that takes the correlation matrix and draws a heatmap (using Seaborn)
def my_corr_plot(cormat):
    sns.heatmap(cormat, cmap = 'viridis',  annot=True, fmt=".2f", square=True, linewidths=.2) #cmap - sets the color style. annot=true - means the numbers will be shown on the heatmap. 
    plt.show()


my_corr_plot(my_corr(df_age_cleaned))

The HeatMap shows a moderate to strong positive correlation between age-related factors and COVID-19 death rates. Median age, percentage aged 65+, and aged 70+ all correlate strongly (~0.66–0.67) with deaths per million, while life expectancy has a moderate correlation (0.52). Median age and life expectancy themselves are highly correlated (0.83), reflecting that countries with older populations tend to have longer life expectancy.

The strong correlation between aged 65+ and aged 70+ is expected since these groups overlap (70+ a subset of 65+), indicating multicollinearity that should be considered in further analysis.

***6.1.5 Scatter Plots***

We will create scatter plots to visualize the relationships between total deaths per million and the age-related factors: median age, aged 65+ and life expectancy. This will help us understand how these variables relate to COVID-19 death rates.

In [None]:
# Visualise the features and the response using scatterplots
sns.pairplot(df_age_cleaned, x_vars=['life_expectancy', 'median_age', 'aged_65_older'], y_vars='total_deaths_per_million', height=5, aspect=1)

The scatter plots show a positive relationship between total deaths per million and the age-related factors. Especially countries with higher median ages and larger percentages of people aged 65+ have a clear postive relationship and tend to have higher total deaths per million. Life expectancy also shows a positive relationship, but with more variability.

 #### 6.2 Data Modelling

Now we want to train a model to predict total deaths per million based on the age-related factors. We will use a linear regression model for this purpose, as it is a simple yet effective way to understand the relationship between the dependent variable (total deaths per million) and independent variables (median age, aged 65+, aged 70+, and life expectancy).

***6.2.1 Multiple Linear Regression***

To assess the relationship between total deaths per million and the age-related factors, we will use a multiple linear regression model. This model will allow us to quantify how each independent variable (median age, aged 65+ and life expectancy) contributes to the dependent variable (total deaths per million).

In [None]:
# Create a Python list of feature names
feature_cols = ['life_expectancy', 'median_age', 'aged_65_older']

# Use the list to select a subset of the original DataFrame
X = df_age_cleaned[feature_cols]

# Print the first 5 rows
X.head()

In [None]:
# Select a Series from the DataFrame for y
y = df_age_cleaned['total_deaths_per_million']

# Print the first 5 values
y.head()

In [None]:
# Check the type and shape of X
print(f"Type of X: {type(X)}")
print(f"Shape of X: {X.shape}")

# Check the type and shape of y
print(f"Type of y: {type(y)}")
print(f"Shape of y: {y.shape}")

6.2.1.1 - Now we are going to split X and y variables into training and testing sets. This is important to evaluate the model's performance on unseen data and avoid overfitting.

In [None]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Default split 75:25
print(f"Shape of X train: {X_train.shape} and y train: {y_train.shape}")
print(f"Shape of X test: {X_test.shape} and y test: {y_test.shape}")

In [None]:
# Print the first 5 values
print(f"First 5 rows of X train:\n{X_test.head()}")
print(f"First 5 values of y train:\n{y_test.head()}")

6.2.1.2 - We can now create a ultiple linear regression model using the training data.

In [None]:
# Create a model
linreg = LinearRegression()

# Fit the model to our training data
linreg.fit(X_train, y_train)

In [None]:
# The intercept and coefficients of the model
print('b0 =', linreg.intercept_)
print('bi =', linreg.coef_)

In [None]:
# Pair the feature names with the coefficients
list(zip(feature_cols, linreg.coef_))

Looking and the slopes (coefficients) for life expectancy, median age and aged 65+, we see that: 

- For each additional year of life expectancy, predicted deaths per million decrease by ~10.6.
- For each additional year of median age, predicted deaths per million increase by ~76.3
- For each 1% increase in population aged 65+, deaths per million increase by ~54.2.

Which indicates that older populations correlate with higher COVID-19 death rates, while higher life expectancy may slightly reduce it.

6.2.1.3 - Now we test the the model with the test data to evaluate its performance. We will use the R-squared value to assess how well the model explains the variance in total deaths per million.

In [None]:
# Make predictions on the testing set
y_predicted = linreg.predict(X_test)

y_predicted

In [None]:
# Mean Absolute Error (MAE) is the mean of the absolute value of the errors:
print(f"Mean Absolute Error (MAE): {metrics.mean_absolute_error(y_test, y_predicted)}")

# Mean Squared Error (MSE) is the mean of the squared errors
print(f"Mean Squared Error (MSE): {metrics.mean_squared_error(y_test, y_predicted)}")

# Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors
print(f"Root Mean Squared Error (RMSE): {np.sqrt(metrics.mean_squared_error(y_test, y_predicted))}")

In [None]:
# R-squared
print(f"Explained variance score: {r2_score(y_test, y_predicted)}")

In [None]:
# Visualise the regression results
plt.title('Multiple Linear Regression')
plt.scatter(y_test, y_predicted, color='blue')
plt.show()

The model captures a meaningful relationship between age-related factors and COVID-19 death rates, explaining over half the variance in mortality across countries. The positive coefficients for median age and aged 65+ confirm that older populations generally experience higher death rates, while higher life expectancy seems protective or associated with lower deaths.

However, the prediction errors (MAE and RMSE) indicate moderate inaccuracies, so other factors beyond age-related variables likely influence COVID-19 deaths as well. This suggests the model is useful for understanding broad trends but may have limitations in precise predictions due to unaccounted variables or data variability.

***6.2.2 Decision Tree***

To further explore the relationship between age-related factors and COVID-19 death rates, we will use a Decision Tree model. This model will allow us to capture non-linear relationships and interactions between the independent variables and the dependent variable.

But first we got to create categories `total_deaths_per_million` in order to use classification algorithms.

In [None]:
# Divide the continuous variable into 3 categories using quantiles
df_age_cleaned['death_rate_category'] = pd.qcut(df_age_cleaned['total_deaths_per_million'], q=3, labels=['Low', 'Medium', 'High'])

In [None]:
# Define the feature columns
feature_cols = ['life_expectancy', 'median_age', 'aged_65_older']

# Select only those columns plus the target column (total_deaths_per_million)
selected_cols = feature_cols + ['death_rate_category']

# Extract the subset DataFrame
df_subset = df_age_cleaned[selected_cols]

# Convert to numpy array
array = df_subset.to_numpy()

# Create two (sub) arrays from it: 
# X - features, all rows, all columns but the last one and y - labels, all rows, the last column
X, y = array[:, :-1], array[:, -1]

# Separate input data into classes based on labels of diagnoses
class0 = np.array(X[y==0])
class1 = np.array(X[y==1])
class2 = np.array(X[y==2])

6.2.2.1 - We will split the data into training and testing sets, just like we did for the multiple linear regression model. This is important to evaluate the model's performance on unseen data and avoid overfitting.

In [None]:
# Split the dataset into into training and testing sets in proportion 8:2 - 80% of it as training data and 20% as a validation dataset
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.15, random_state=12)

# Build Decision Trees Classifier 
classifier = DecisionTreeClassifier(max_depth=4)
classifier.fit(X_train, y_train)

In [None]:
# Install the graphviz package for DT visualisation
# %pip install graphviz
# import graphviz

# Draw tree from the trained data by graphviz package
# dot_data = tree.export_graphviz(classifier, out_file=None, 
                         # feature_names=feature_cols, class_names = True,        
                         # filled=True, rounded=True, proportion = False,
                         # special_characters=True) 

# Result DT saved in file age.pdf
# graph = graphviz.Source(dot_data)
# graph.render("Data/age") 

# Show the graph
# graph

Not all of us could get the graphviz import to work, even though we installed it correctly. Luckly one of us was able to get it to work, so we could visualise the decision tree and instead upload it as a png file.

In [None]:
from IPython.display import Image, display

# Replace with your image path
display(Image(filename='Data/DecisionTree.png'))

The decision tree model, trained to predict COVID-19 death rate categories using life expectancy, median age, and aged 65+, achieved a balanced 70% test accuracy when limited to a max depth of 4, avoiding overfitting.

Life expectancy was the most influential predictor, consistently used as the root split. The model showed that countries with lower life expectancy and older populations were more likely to have higher death rates.

Overall, the model was both accurate and interpretable, supporting earlier EDA insights and highlighting the importance of demographic factors in explaining COVID-19 mortality differences.

6.2.2.2 - We need to validate the model to ensure it is performing well. We will use the accuracy score, precision, recall and F1-score to evaluate the model's performance.

In [None]:
# Set the metrics
scoring = 'accuracy'

# Predict the labels of the test data
y_testp = classifier.predict(X_test)
print(f"Predicted labels:\n{y_testp}\n")
print(f"Observed labels:\n{y_test}")

In [None]:
# Calculated the accuracy of the model comparing the observed data and predicted data
print ("Accuracy is ", accuracy_score(y_test,y_testp))

The accuracy score of 0.7 indicates that the model correctly classifies 70% of the test data, which is a good result. The precision of 0.7 means that when the model predicts a certain category, it is correct 70% of the time.

In [None]:
# Create confusion matrix
confusion_mat = confusion_matrix(y_test,y_testp)
print(f"Confusion matrix:\n{confusion_mat}\n")

# Create cross table
confusion = pd.crosstab(y_test,y_testp)
print(f"Cross table:\n{confusion}")

In [None]:
# Visualize confusion matrix
plt.imshow(confusion_mat, interpolation='nearest', cmap=plt.cm.viridis)
plt.title('Confusion matrix')
plt.colorbar()
ticks = np.arange(3)
plt.xticks(ticks, ticks)
plt.yticks(ticks, ticks)
plt.ylabel('True labels')
plt.xlabel('Predicted labels')
plt.show()

In [None]:
sns.heatmap(confusion_mat, annot=True)

As we see from both the confuson matrix and cross table 

In [None]:
class_names = ['Class0', 'Class1', 'Class2']

# Classifier performance on training dataset
print(f"Traning dataset:\n{classification_report(y_train, classifier.predict(X_train), target_names=class_names)}\n")
plt.show()

# Classifier performance on test dataset
print(f"Test dataset:\n{classification_report(y_test, classifier.predict(X_test), target_names=class_names)}")
plt.show()

Based on all the above metrics, the decision tree model performs reasonably well in classifying COVID-19 death rate categories based on age-related factors. There is a 10% drop in performance from training to test, which is reasonably good for a 3-class classification problem. 

- The model performs best at predicting "Low" (Class1) with a perfect recall (1.00) and high f1-score (0.78).
- "High" (Class0) also performs well with a balanced precision (0.80) and recall (0.75).
- "Medium" (Class2) is the weakest, with low recall (0.29) and f1-score (0.36) – it’s often confused with other classes.

The decision tree model shows good overall performance, especially in identifying "High" and "Low" death rate categories, but struggles with the "Medium" category. This may suggest that "Medium" overlaps more with the other two groups or because classifying boundaries that are less distinct can be a challenge.

#### 6.3 Conclusion of Hypothesis 3

Exploratory Data Analysis revealed clear positive relationships between COVID-19 death rates and demographic indicators like median age and percentage of the population aged 65+. These patterns were consistently supported by scatterplots, correlation heatmaps, and box plots, which showed that countries with older populations tended to fall into higher death rate categories.

The multiple linear regression model reinforced this, indicating that median age and % aged 65+ have strong positive influences on death rates, while life expectancy had a weaker or even slightly negative association. The decision tree classifier further validated these findings, achieving solid predictive accuracy (≈70%) and using life expectancy, median age, and % aged 65+ to effectively distinguish between low, medium, and high death rate categories—particularly excelling at identifying countries at the extremes.

***The hypothesis is supported:*** there is substantial evidence that older population structure is associated with higher COVID-19 mortality. However, since life expectancy showed a weaker or negative link, and some countries with younger populations still had high death rates, age is a key factor but not the sole determinant—other contextual factors (like healthcare capacity or policy response) may also play a role.


---

### 7. Hypotese 4: Countries with higher prevalence of chronic health conditions (e.g. cardiovascular death rate, diabetes, smoking) have higher COVID-19 death rates

We chose to investigate this hypothesis because chronic conditions like heart disease, diabetes, and smoking are known to increase the risk of severe COVID-19 outcomes. Our goal is to examine if countries with higher rates of these conditions also experienced higher COVID-19 death rates.








 #### 7.1 Explore

In [None]:
df_health_cleaned

In [None]:
df_health_cleaned.info()

In [None]:
df_health_cleaned.describe()

Now that we explored the new cleaned dataframe a bit, we can see that the df_health_cleaned dataframe contains a more manageable number of columns and rows vs the original dataframe. The columns we have retained are relevant for our analysis, and we have removed unnecessary or redundant features.

##### 7.1.1 Check for outliers in the df_health_cleaned

The next step in exploring the data is checking for outlier values that are unusually high or low compared to the rest of the data.
We use the IQR (Interquartile Range) method, which is a common way to detect outliers.

We apply this method to the five important features regarding our hypotheses: 
- cardiovasc_death_rate
- diabetes_prevalence
- female_smokers
- male_smokers
- total_deaths_per_million

This helps us find any unusual data points that could affect the results of our analysis.

In [None]:
# Check for outliers in the df_health_cleaned dataframe using IQR method
print("\n..Checking for outliers in the df_health_cleaned dataframe:")

# Loop through selected columns
for column in ['cardiovasc_death_rate', 'diabetes_prevalence', 'female_smokers', 'male_smokers', 'total_deaths_per_million']:
    # Calculate Q1 (25th percentile) and Q3 (75th percentile)
    Q1 = df_health_cleaned[column].quantile(0.25)
    Q3 = df_health_cleaned[column].quantile(0.75)
    IQR = Q3 - Q1  # Interquartile Range

    # Define the lower and upper bounds for detecting outliers
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Find rows where the value is outside the normal range
    outliers = df_health_cleaned[
        (df_health_cleaned[column] < lower_bound) | 
        (df_health_cleaned[column] > upper_bound)
    ]

    # Print the number of outliers found for the column
    print(f"  {column}: {len(outliers)} outliers detected")
    print(outliers[['location', column]])

##### 7.1.2 Conclusion of outliers: 
We identified several outliers, especially in diabetes_prevalence (7 outliers). Although these values can affect averages and visualizations, we chose to keep them because they likely reflect real-world differences between countries. Removing them could hide important patterns in how chronic health conditions relate to COVID-19 death rates.

##### 7.1.3 Visualize the correlation between health conditions and Covid-19 death rate

##### 7.1.3.1 Scatterplot

We use a scatterplot to visualize how each health condition is associated with COVID-19 death rates across countries. It helps reveal any visible patterns or trends in the data.

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

sns.scatterplot(data=df_health_cleaned, x='cardiovasc_death_rate', y='total_deaths_per_million', ax=axes[0, 0])
axes[0, 0].set_title('Cardiovascular Death Rate vs COVID-19 Deaths per Million')

sns.scatterplot(data=df_health_cleaned, x='diabetes_prevalence', y='total_deaths_per_million', ax=axes[0, 1])
axes[0, 1].set_title('Diabetes Prevalence vs COVID-19 Deaths per Million')

sns.scatterplot(data=df_health_cleaned, x='female_smokers', y='total_deaths_per_million', ax=axes[1, 0])
axes[1, 0].set_title('Female Smokers vs COVID-19 Deaths per Million')

sns.scatterplot(data=df_health_cleaned, x='male_smokers', y='total_deaths_per_million', ax=axes[1, 1])
axes[1, 1].set_title('Male Smokers vs COVID-19 Deaths per Million')

plt.tight_layout()
plt.show()

The above scatterplots suggest only weak or no clear correlations between chronic health conditions and COVID-19 death rates across countries. Cardiovascular death rate and diabetes prevalence show no strong trend, as death rates remain spread across different values. Female smokers appear to have a slightly positive association with higher death rates, whereas male smokers show no clear pattern. Overall, the visualizations indicate that these individual health indicators alone may not strongly explain variations in COVID-19 mortality, and other factors likely contribute more significantly.

##### 7.1.3.2 Correlation matrix

We use a correlation matrix to examine the strength and direction of the relationships between the multiple health-related variables and COVID-19 death rates. It helps us compare all variables at once and identify potential patterns or associations

In [None]:
corr_matrix = df_health_cleaned[['total_deaths_per_million', 'cardiovasc_death_rate', 'diabetes_prevalence', 'female_smokers', 'male_smokers']].corr()

In [None]:
sns.heatmap(corr_matrix, annot=True, cmap='YlOrRd')

The above matrix shows that female_smokers has the strongest positive correlation with COVID-19 death rates (0.58), while male_smokers shows a weak correlation (0.13), and cardiovascular_death_rate and diabetes_prevalence are slightly negative (-0.14 and -0.065). This suggests that none of the health conditions show a strong linear relationship with COVID-19 deaths, though female smoking stands out with a moderate correlation worth noting.

<!-- The above scatterplot shows no clear negative correlation between HDI and COVID-19 death rates. High HDI countries vary widely in death rates, suggesting that HDI alone does not explain the differences. Other factors likely play a role.

Countries with low HDI values do not consistently show higher death rates either, reinforcing that HDI alone is not a strong predictor of COVID-19 mortality -->

 #### 7.2 Data Modelling

##### 7.2.1 Multiple Linear regression (Supervised Machine Learning)

To further investigate the relationship between various health factors and COVID-19 death rates per million, we apply multiple linear regression. This approach helps us understand how strongly each factor contributes to differences in mortality rates, and whether these factors can collectively explain the variation in COVID-19 deaths across countries.

In [None]:
# Create a list of the features names
feature_cols = ['cardiovasc_death_rate', 'diabetes_prevalence', 'female_smokers', 'male_smokers']

# Select only the relevant predictor variables (independent) from the dataframe
X = df_health_cleaned[feature_cols]

X.head()

In [None]:
# Select the target variable (dependent variable) for prediction
y = df_health_cleaned['total_deaths_per_million']

# Print the first 5 values
y.head()

In [None]:
# Check the type and shape of X
print(type(X))
print(X.shape)

In [None]:
# Check the type and shape of y
print(type(y))
print(y.shape)

In [None]:
# Splitting X and y into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123, test_size=0.20)

In [None]:
# The shape of the subsets
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
# Creating an instance of Linear Regression model
linreg = LinearRegression()

# Fit the model to our training data
linreg.fit(X_train, y_train)
linreg

In [None]:
# The intercept and coefficients are stored in system variables
print('b0 =', linreg.intercept_)
print('bi =', linreg.coef_)

In [None]:
# Pair the feature names with the coefficients
list(zip(feature_cols, linreg.coef_))

In [None]:
# Make predictions on the testing set
y_predicted = linreg.predict(X_test)
y_predicted

In [None]:
# Mean Absolute Error (MAE)
print("MAE:", metrics.mean_absolute_error(y_test, y_predicted))

# Mean Squared Error (MSE) 
print("MSE:", metrics.mean_squared_error(y_test, y_predicted))

# Root Mean Squared Error (RMSE)
print("RMSE:", np.sqrt(metrics.mean_squared_error(y_test, y_predicted)))

# R-squared
print("R² score:", r2_score(y_test, y_predicted))

In [None]:
# Measure how well the model explains the variation in the data (1 = perfect prediction)
eV = round(metrics.explained_variance_score(y_test, y_predicted), 2)
print('Explained variance score ',eV )

In [None]:
# Visualize the regression results
plt.title('Multiple Linear Regression')
plt.scatter(y_test, y_predicted, color='green', label='Predicted vs Actual')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color='red', linestyle='--', label='Perfect Prediction')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.show()

The green dots represent individual countries, where the x-axis shows the actual COVID-19 death rates and the y-axis shows the predicted rates from the model. The red dashed line indicates perfect predictions. Since many green dots are far from the line, especially at higher values, it shows that the model struggles to accurately predict the death rates based on the health variables.

##### 7.2.1.1 Conclusion of multiple linear regression

<!-- Based on the results, the linear regression model dos not perform well.
The average error (MAE) is 774 and the root mean square error (RMSE) is over 1000, which means the predictions are far from the actual values.
The R² score is only 0.28, meaning that HDI explains just 28% of the differences in death rates between countries.
This suggests that HDI alone is not a good predictor of COVID-19 mortality, and that other factors likely play a more important role. -->

The multiple linear regression model gave an R² score of 0.41, meaning it explains only 41% of the variation in COVID-19 death rates. The RMSE was 929 and MAE was 781, showing that the predicted values differ quite a lot from the actual ones. This suggests that the selected health factors (like diabetes and smoking) are not strong predictors on their own. 

 #### 7.3 Conclusion of Hypothesis 4 

The results show only a weak relationship between the health conditions and COVID-19 death rates. The regression model explained just 41% of the variation (R² = 0.41), and the prediction errors (MAE = 781, RMSE = 929) were high. Only female smoking showed a moderate positive correlation.

Although research shows that chronic conditions increase the individual risk of severe illness or death from COVID-19, our hypothesis looked at country-level death rates. At the national level, many other factors — such as healthcare access, vaccination rates, and data reporting — likely affect the overall death rates. This may explain why our model only showed a weak relationship. Therefore, the hypothesis is only partially supported.

---

### 8. Summary and Conclusion of the Project