# Mini Project 3: COVID-19 Data Analysis and Machine Learning

#### Created by Group 7 - Kamilla, Jeanette, Juvena

## Objective

This assignment aims to develop practical skills in data analysis, visualization, and machine learning using real-world COVID-19 data. The project focuses on exploring global pandemic-related indicators to uncover trends, build predictive models, and apply both supervised and unsupervised learning techniques using Python.

Before we begin analyzing the COVID-19 dataset, we need to import a few essential Python libraries that will help us manipulate the data, build models, and visualize our findings:

- **Pandas**: This is a powerful library used to handle and manipulate data in tables (called DataFrames).
- **NumPy**: It helps with numerical operations, especially when we work with arrays or need to do math.
- **Matplotlib** and **Seaborn**: These are popular libraries for creating visual charts and graphs. We'll use them to help us understand the data better by seeing it.
- **SciPy (stats module)**: This gives us access to statistical tools like checking if data is normally distributed.

We'll also configure default styles for our plots to ensure they're clean, visually appealing, and easy to interpret.

In [278]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn import metrics
from scipy.spatial.distance import cdist
from sklearn import preprocessing as prep
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

# Set plot styles for better visualization
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("Set2")
fl

--------

# 1. Data wrangling and exploration

### Load the Data

Now that we have our tools ready, the next step is to load the COVID-19 dataset into Python so we can start analyzing it.

In this case, we’re working with a single dataset:

- **OWID COVID-19 Latest Data**: a CSV file that contains country-level information on cases, deaths, vaccinations, testing, and various socioeconomic indicators.

We'll use Pandas to read the CSV file and store it as a DataFrame. To make our code cleaner and reusable, we'll define a simple function that loads the data and performs some initial checks. This way, we can easily reload or replace the dataset if needed in future steps.

In [283]:
# File paths for the covid datasets. (dataset: last updated 2024-08-04)
dataset_covid = 'Dataset/owid-covid-latest.csv'

# Function to load the Excel files
def load_csv_to_dataframe(file_path):
    # Reads the Excel file and skips the first row if it contains a description or title
    df = pd.read_csv(file_path)
    return df

# Load datasets
print("..Loading COVID-19 dataset")
df_covid = load_csv_to_dataframe(dataset_covid)

..Loading COVID-19 dataset


### Explore the Data

After loading the dataset, we want to explore it to understand what kind of information it contains and how it's structured.

To do this, we can use several helpful Pandas functions such as `shape`, `types`, `info()`, `head()`, `tail()`, `sample()`, `describe()` and `isnull().sum()`. These functions will give us insights into the number of rows and columns, the data types of each column, a summary of the data, and any missing values. 

This exploration is crucial as it helps us identify potential issues or areas that need further cleaning or transformation before we proceed with our analysis. 

In [286]:
# Check the shape of the DataFrame (rows, columns)
df_covid.shape

(247, 67)

In [287]:
# Display the types of attributes (colum names) in the DataFrame
df_covid.dtypes

iso_code                                    object
continent                                   object
location                                    object
last_updated_date                           object
total_cases                                float64
                                            ...   
population                                 float64
excess_mortality_cumulative_absolute       float64
excess_mortality_cumulative                float64
excess_mortality                           float64
excess_mortality_cumulative_per_million    float64
Length: 67, dtype: object

In [288]:
# Gives an overview of the DataFrame
df_covid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 247 entries, 0 to 246
Data columns (total 67 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   iso_code                                    247 non-null    object 
 1   continent                                   235 non-null    object 
 2   location                                    247 non-null    object 
 3   last_updated_date                           247 non-null    object 
 4   total_cases                                 246 non-null    float64
 5   new_cases                                   242 non-null    float64
 6   new_cases_smoothed                          242 non-null    float64
 7   total_deaths                                246 non-null    float64
 8   new_deaths                                  243 non-null    float64
 9   new_deaths_smoothed                         243 non-null    float64
 10  total_cases_pe

In [289]:
# Display the first 5 rows of the DataFrame
df_covid.head()

Unnamed: 0,iso_code,continent,location,last_updated_date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
0,AFG,Asia,Afghanistan,2024-08-04,235214.0,0.0,0.0,7998.0,0.0,0.0,...,,37.746,0.5,64.83,0.511,41128770.0,,,,
1,OWID_AFR,,Africa,2024-08-04,13145380.0,36.0,5.143,259117.0,0.0,0.0,...,,,,,,1426737000.0,,,,
2,ALB,Europe,Albania,2024-08-04,335047.0,0.0,0.0,3605.0,0.0,0.0,...,51.2,,2.89,78.57,0.795,2842318.0,,,,
3,DZA,Africa,Algeria,2024-08-04,272139.0,18.0,2.571,6881.0,0.0,0.0,...,30.4,83.741,1.9,76.88,0.748,44903230.0,,,,
4,ASM,Oceania,American Samoa,2024-08-04,8359.0,0.0,0.0,34.0,0.0,0.0,...,,,,73.74,,44295.0,,,,


In [290]:
# Display the last 5 rows of the DataFrame
df_covid.tail()

Unnamed: 0,iso_code,continent,location,last_updated_date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
242,WLF,Oceania,Wallis and Futuna,2024-08-04,3760.0,0.0,0.0,9.0,0.0,0.0,...,,,,79.94,,11596.0,,,,
243,OWID_WRL,,World,2024-08-14,775866783.0,47169.0,6738.429,7057132.0,815.0,116.429,...,34.635,60.13,2.705,72.58,0.737,7975105000.0,,,,
244,YEM,Asia,Yemen,2024-08-04,11945.0,0.0,0.0,2159.0,0.0,0.0,...,29.2,49.542,0.7,66.12,0.47,33696610.0,,,,
245,ZMB,Africa,Zambia,2024-08-04,349842.0,18.0,2.571,4077.0,0.0,0.0,...,24.7,13.938,2.0,63.89,0.584,20017670.0,,,,
246,ZWE,Africa,Zimbabwe,2024-08-04,266386.0,0.0,0.0,5740.0,0.0,0.0,...,30.7,36.791,1.7,61.49,0.571,16320540.0,,,,


In [291]:
# Display a random sample of 5 rows from the DataFrame
df_covid.sample(5)

Unnamed: 0,iso_code,continent,location,last_updated_date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
234,VIR,North America,United States Virgin Islands,2024-08-04,25389.0,0.0,0.0,132.0,0.0,0.0,...,,,,80.58,,99479.0,,,,
109,JAM,North America,Jamaica,2024-08-04,157181.0,12.0,1.714,3611.0,0.0,0.0,...,28.6,66.425,1.7,74.47,0.734,2827382.0,,,,
152,NRU,Oceania,Nauru,2024-08-04,5393.0,0.0,0.0,1.0,0.0,0.0,...,36.9,,5.0,59.96,,12691.0,,,,
17,BHR,Asia,Bahrain,2024-08-04,696614.0,0.0,0.0,1536.0,0.0,0.0,...,37.6,,2.0,77.29,0.852,1472237.0,,,,
21,BEL,Europe,Belgium,2024-08-04,4872829.0,1277.0,182.429,34339.0,0.0,0.0,...,31.4,,5.64,81.63,0.931,11655923.0,,,,


In [292]:
# Gives summary statistics for all numerical columns in the dataset
df_covid.describe()

Unnamed: 0,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,...,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
count,246.0,242.0,242.0,246.0,243.0,243.0,246.0,242.0,242.0,246.0,...,145.0,96.0,173.0,231.0,190.0,247.0,0.0,0.0,0.0,0.0
mean,13366340.0,885.607438,126.515355,119868.9,14.032922,2.00472,203988.255797,22.204909,3.172136,1271.427736,...,32.909897,50.788844,3.097012,73.660866,0.7225,130765600.0,,,,
std,65681300.0,4854.786157,693.540908,574724.0,92.179347,13.16853,200456.90214,82.962646,11.851812,1322.697453,...,13.621757,32.124848,2.555777,7.405725,0.149398,668433300.0,,,,
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,7.7,1.188,0.1,53.28,0.394,47.0,,,,
25%,27509.5,0.0,0.0,183.75,0.0,0.0,21257.7665,0.0,0.0,144.80825,...,22.6,20.482,1.3,69.545,0.603,429495.5,,,,
50%,232098.5,0.0,0.0,2205.5,0.0,0.0,135384.895,0.0,0.0,877.689,...,33.1,49.6905,2.5,75.05,0.74,5970430.0,,,,
75%,1703974.0,5.5,0.7855,19388.5,0.0,0.0,340625.3,0.232,0.03325,2032.222,...,41.3,82.68675,4.2,79.285,0.82875,28956710.0,,,,
max,775866800.0,47169.0,6738.429,7057132.0,815.0,116.429,763598.6,672.437,96.062,6601.11,...,78.1,100.0,13.8,86.75,0.957,7975105000.0,,,,


##### Summary of exploring the data

After exploring the dataframe, we found that it contains a large number of columns, many of which are not useful for our analysis or modeling goals. While some columns provide valuable information (like total cases, deaths, and vaccination rates), others are either redundant, mostly empty, or irrelevant.

This highlights the need for a thorough data cleaning step to remove unnecessary columns, handle missing values, and focus only on the most relevant features for our machine learning tasks.

### Clean the Data

After loading and exploring the data, we need to clean it to ensure that our analysis is accurate and meaningful. Data cleaning involves several steps, including: checking for missing values, removing duplicates, and converting data types.

In [297]:
# Check for missing values in the DataFrame
df_covid.isnull().sum()

iso_code                                     0
continent                                   12
location                                     0
last_updated_date                            0
total_cases                                  1
                                          ... 
population                                   0
excess_mortality_cumulative_absolute       247
excess_mortality_cumulative                247
excess_mortality                           247
excess_mortality_cumulative_per_million    247
Length: 67, dtype: int64

In [298]:
# Above shows there is a lot of columns with no values, so we want to remove those 

In [299]:
# before securing OWID fields, we want to remove the irrelevant rows like High-income countries etc.
rows_to_remove = ["OWID_UMC", "OWID_WRL", "OWID_LMC", "OWID_LIC", "OWID_HIC"]
df_removed_rows = df_covid[~df_covid["iso_code"].isin(rows_to_remove)]



In [300]:
#checking if the above rows are removed
df_removed_rows.tail()


Unnamed: 0,iso_code,continent,location,last_updated_date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
241,VNM,Asia,Vietnam,2024-08-04,11624000.0,0.0,0.0,43206.0,0.0,0.0,...,45.9,85.847,2.6,75.4,0.704,98186856.0,,,,
242,WLF,Oceania,Wallis and Futuna,2024-08-04,3760.0,0.0,0.0,9.0,0.0,0.0,...,,,,79.94,,11596.0,,,,
244,YEM,Asia,Yemen,2024-08-04,11945.0,0.0,0.0,2159.0,0.0,0.0,...,29.2,49.542,0.7,66.12,0.47,33696612.0,,,,
245,ZMB,Africa,Zambia,2024-08-04,349842.0,18.0,2.571,4077.0,0.0,0.0,...,24.7,13.938,2.0,63.89,0.584,20017670.0,,,,
246,ZWE,Africa,Zimbabwe,2024-08-04,266386.0,0.0,0.0,5740.0,0.0,0.0,...,30.7,36.791,1.7,61.49,0.571,16320539.0,,,,


In [301]:
# we will drop all columns with no values at all 
df_covid= df_removed_rows.dropna(axis=1, how='all')

In [302]:
#check if the columns are removed
df_covid.shape

(242, 52)

In [303]:
# we make a new dataset with the columns we want to keep and we think are relevant
columns_we_want_to_keep = [
    "iso_code", "continent", "location", "total_cases", "total_deaths",
    "total_cases_per_million", "total_deaths_per_million",
    "total_vaccinations", "people_vaccinated", "people_fully_vaccinated",
    "total_boosters", "new_vaccinations", "new_vaccinations_smoothed",
    "total_vaccinations_per_hundred", "people_vaccinated_per_hundred",
    "people_fully_vaccinated_per_hundred", "total_boosters_per_hundred",
    "new_vaccinations_smoothed_per_million", "new_people_vaccinated_smoothed",
    "new_people_vaccinated_smoothed_per_hundred", "population_density",
    "median_age", "aged_65_older", "aged_70_older", "cardiovasc_death_rate",
    "diabetes_prevalence", "female_smokers", "male_smokers",
    "life_expectancy", "population"
]

# removes all other columns
df_covid = df_covid[columns_we_want_to_keep]

In [304]:
#check if the columns are removed
df_covid.info()

<class 'pandas.core.frame.DataFrame'>
Index: 242 entries, 0 to 246
Data columns (total 30 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   iso_code                                    242 non-null    object 
 1   continent                                   235 non-null    object 
 2   location                                    242 non-null    object 
 3   total_cases                                 241 non-null    float64
 4   total_deaths                                241 non-null    float64
 5   total_cases_per_million                     241 non-null    float64
 6   total_deaths_per_million                    241 non-null    float64
 7   total_vaccinations                          13 non-null     float64
 8   people_vaccinated                           11 non-null     float64
 9   people_fully_vaccinated                     11 non-null     float64
 10  total_boosters     

In [305]:
# check how the dataset looks and how we should proceed
df_covid

Unnamed: 0,iso_code,continent,location,total_cases,total_deaths,total_cases_per_million,total_deaths_per_million,total_vaccinations,people_vaccinated,people_fully_vaccinated,...,population_density,median_age,aged_65_older,aged_70_older,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,life_expectancy,population
0,AFG,Asia,Afghanistan,235214.0,7998.0,5796.468,197.098,,,,...,54.422,18.6,2.581,1.337,597.029,9.59,,,64.83,4.112877e+07
1,OWID_AFR,,Africa,13145380.0,259117.0,9088.877,179.157,,,,...,,,,,,,,,,1.426737e+09
2,ALB,Europe,Albania,335047.0,3605.0,118491.020,1274.926,,,,...,104.871,38.0,13.188,8.643,304.195,10.08,7.1,51.2,78.57,2.842318e+06
3,DZA,Africa,Algeria,272139.0,6881.0,5984.050,151.306,,,,...,17.348,29.1,6.211,3.857,278.364,6.73,0.7,30.4,76.88,4.490323e+07
4,ASM,Oceania,American Samoa,8359.0,34.0,172831.600,702.988,,,,...,278.205,,,,283.750,,,,73.74,4.429500e+04
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
241,VNM,Asia,Vietnam,11624000.0,43206.0,116612.400,433.444,,,,...,308.127,32.6,7.150,4.718,245.465,6.00,1.0,45.9,75.40,9.818686e+07
242,WLF,Oceania,Wallis and Futuna,3760.0,9.0,326928.100,782.541,,,,...,,,,,,,,,79.94,1.159600e+04
244,YEM,Asia,Yemen,11945.0,2159.0,312.509,56.484,,,,...,53.508,20.3,2.922,1.583,495.003,5.35,7.6,29.2,66.12,3.369661e+07
245,ZMB,Africa,Zambia,349842.0,4077.0,17359.357,202.303,,,,...,22.995,17.7,2.480,1.542,234.499,3.94,3.1,24.7,63.89,2.001767e+07


In [307]:
# before removing iso_code column, we want to secure OWID fields, because it could be relevant data
rows_to_secure = ["OWID_AFR", "OWID_ASI", "OWID_EUR", "OWID_EUN", "OWID_NAM", "OWID_OCE", "OWID_SAM"]
df_continents = df_covid[df_covid["iso_code"].isin(rows_to_secure)]


In [308]:
# check 
df_continents

Unnamed: 0,iso_code,continent,location,total_cases,total_deaths,total_cases_per_million,total_deaths_per_million,total_vaccinations,people_vaccinated,people_fully_vaccinated,...,population_density,median_age,aged_65_older,aged_70_older,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,life_expectancy,population
1,OWID_AFR,,Africa,13145380.0,259117.0,9088.877,179.157,,,,...,,,,,,,,,,1426737000.0
12,OWID_ASI,,Asia,301499099.0,1637249.0,63948.2,347.262,9104305000.0,3689439000.0,3462095000.0,...,,,,,,,,,,4721383000.0
70,OWID_EUR,,Europe,252916868.0,2102483.0,337990.34,2809.694,1399334000.0,523814300.0,493751300.0,...,,,,,,,,,,744807800.0
71,OWID_EUN,,European Union (27),185822587.0,1262988.0,413754.22,2812.18,951113300.0,338119600.0,327967400.0,...,,,,,,,,,,450146800.0
161,OWID_NAM,,North America,124492666.0,1671178.0,205992.19,2765.22,1158547000.0,458563500.0,394493900.0,...,,,,,,,,,,600323700.0
166,OWID_OCE,,Oceania,15003352.0,32918.0,333039.8,730.704,88358810.0,28960500.0,28072900.0,...,,,,,,,,,,45038860.0
207,OWID_SAM,,South America,68809418.0,1354187.0,159838.72,3145.667,,,,...,,,,,,,,,,436816700.0


In [309]:
# remove columns there are irrelavnt from df_continents
df_continents_cleaned= df_continents.dropna(axis=1, how='all')
df_continents_cleaned

Unnamed: 0,iso_code,location,total_cases,total_deaths,total_cases_per_million,total_deaths_per_million,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,new_vaccinations,new_vaccinations_smoothed,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,new_vaccinations_smoothed_per_million,new_people_vaccinated_smoothed,new_people_vaccinated_smoothed_per_hundred,population
1,OWID_AFR,Africa,13145380.0,259117.0,9088.877,179.157,,,,,,,,,,,,,,1426737000.0
12,OWID_ASI,Asia,301499099.0,1637249.0,63948.2,347.262,9104305000.0,3689439000.0,3462095000.0,1815177000.0,258.0,193.0,192.83,78.14,73.33,38.45,0.0,10.0,0.0,4721383000.0
70,OWID_EUR,Europe,252916868.0,2102483.0,337990.34,2809.694,1399334000.0,523814300.0,493751300.0,365099900.0,64.0,17.0,187.88,70.33,66.29,49.02,0.0,2.0,0.0,744807800.0
71,OWID_EUN,European Union (27),185822587.0,1262988.0,413754.22,2812.18,951113300.0,338119600.0,327967400.0,282438800.0,64.0,17.0,211.29,75.11,72.86,62.74,0.0,2.0,0.0,450146800.0
161,OWID_NAM,North America,124492666.0,1671178.0,205992.19,2765.22,1158547000.0,458563500.0,394493900.0,256264800.0,442.0,442.0,192.99,76.39,65.71,42.69,1.0,0.0,0.0,600323700.0
166,OWID_OCE,Oceania,15003352.0,32918.0,333039.8,730.704,88358810.0,28960500.0,28072900.0,25400950.0,1130.0,1130.0,196.18,64.3,62.33,56.4,25.0,0.0,0.0,45038860.0
207,OWID_SAM,South America,68809418.0,1354187.0,159838.72,3145.667,,,,,,,,,,,,,,436816700.0


In [310]:
df_continents_cleaned.shape

(7, 20)

In [311]:
#removing columns there are irrelevant 
df_continents_cleaned = df_continents_cleaned.drop(['new_vaccinations_smoothed_per_million', 'new_people_vaccinated_smoothed', 'new_people_vaccinated_smoothed_per_hundred'], axis=1)

df_continents_cleaned

Unnamed: 0,iso_code,location,total_cases,total_deaths,total_cases_per_million,total_deaths_per_million,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,new_vaccinations,new_vaccinations_smoothed,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,population
1,OWID_AFR,Africa,13145380.0,259117.0,9088.877,179.157,,,,,,,,,,,1426737000.0
12,OWID_ASI,Asia,301499099.0,1637249.0,63948.2,347.262,9104305000.0,3689439000.0,3462095000.0,1815177000.0,258.0,193.0,192.83,78.14,73.33,38.45,4721383000.0
70,OWID_EUR,Europe,252916868.0,2102483.0,337990.34,2809.694,1399334000.0,523814300.0,493751300.0,365099900.0,64.0,17.0,187.88,70.33,66.29,49.02,744807800.0
71,OWID_EUN,European Union (27),185822587.0,1262988.0,413754.22,2812.18,951113300.0,338119600.0,327967400.0,282438800.0,64.0,17.0,211.29,75.11,72.86,62.74,450146800.0
161,OWID_NAM,North America,124492666.0,1671178.0,205992.19,2765.22,1158547000.0,458563500.0,394493900.0,256264800.0,442.0,442.0,192.99,76.39,65.71,42.69,600323700.0
166,OWID_OCE,Oceania,15003352.0,32918.0,333039.8,730.704,88358810.0,28960500.0,28072900.0,25400950.0,1130.0,1130.0,196.18,64.3,62.33,56.4,45038860.0
207,OWID_SAM,South America,68809418.0,1354187.0,159838.72,3145.667,,,,,,,,,,,436816700.0


In [312]:
df_continents_cleaned.shape

(7, 17)

In [313]:
df_continents_cleaned.duplicated().sum()

0

In [314]:
# removing OWID fields there are irrelevant for this data
rows_to_remove = ["OWID_AFR", "OWID_ASI", "OWID_EUR", "OWID_EUN", "OWID_NAM", "OWID_OCE", "OWID_SAM"]
df_covid_removed_rows = df_covid[~df_covid['iso_code'].isin(rows_to_remove)]
df_covid_removed_rows
                                

Unnamed: 0,iso_code,continent,location,total_cases,total_deaths,total_cases_per_million,total_deaths_per_million,total_vaccinations,people_vaccinated,people_fully_vaccinated,...,population_density,median_age,aged_65_older,aged_70_older,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,life_expectancy,population
0,AFG,Asia,Afghanistan,235214.0,7998.0,5796.468,197.098,,,,...,54.422,18.6,2.581,1.337,597.029,9.59,,,64.83,41128772.0
2,ALB,Europe,Albania,335047.0,3605.0,118491.020,1274.926,,,,...,104.871,38.0,13.188,8.643,304.195,10.08,7.1,51.2,78.57,2842318.0
3,DZA,Africa,Algeria,272139.0,6881.0,5984.050,151.306,,,,...,17.348,29.1,6.211,3.857,278.364,6.73,0.7,30.4,76.88,44903228.0
4,ASM,Oceania,American Samoa,8359.0,34.0,172831.600,702.988,,,,...,278.205,,,,283.750,,,,73.74,44295.0
5,AND,Europe,Andorra,48015.0,159.0,602280.440,1994.431,,,,...,163.755,,,,109.135,7.97,29.0,37.8,83.73,79843.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
241,VNM,Asia,Vietnam,11624000.0,43206.0,116612.400,433.444,,,,...,308.127,32.6,7.150,4.718,245.465,6.00,1.0,45.9,75.40,98186856.0
242,WLF,Oceania,Wallis and Futuna,3760.0,9.0,326928.100,782.541,,,,...,,,,,,,,,79.94,11596.0
244,YEM,Asia,Yemen,11945.0,2159.0,312.509,56.484,,,,...,53.508,20.3,2.922,1.583,495.003,5.35,7.6,29.2,66.12,33696612.0
245,ZMB,Africa,Zambia,349842.0,4077.0,17359.357,202.303,,,,...,22.995,17.7,2.480,1.542,234.499,3.94,3.1,24.7,63.89,20017670.0


In [350]:
# remove columns there are irrelavnt from df_covid_cleaned
df_covid_cleaned = df_covid_removed_rows.drop([
    "total_vaccinations", "people_vaccinated", "people_fully_vaccinated",
    "total_boosters", "new_vaccinations", "new_vaccinations_smoothed",
    "total_vaccinations_per_hundred", "people_vaccinated_per_hundred",
    "people_fully_vaccinated_per_hundred", "total_boosters_per_hundred",
    "new_vaccinations_smoothed_per_million", "new_people_vaccinated_smoothed",
    "new_people_vaccinated_smoothed_per_hundred"], axis=1)
df_covid_cleaned

Unnamed: 0,iso_code,continent,location,total_cases,total_deaths,total_cases_per_million,total_deaths_per_million,population_density,median_age,aged_65_older,aged_70_older,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,life_expectancy,population
0,AFG,Asia,Afghanistan,235214.0,7998.0,5796.468,197.098,54.422,18.6,2.581,1.337,597.029,9.59,,,64.83,41128772.0
2,ALB,Europe,Albania,335047.0,3605.0,118491.020,1274.926,104.871,38.0,13.188,8.643,304.195,10.08,7.1,51.2,78.57,2842318.0
3,DZA,Africa,Algeria,272139.0,6881.0,5984.050,151.306,17.348,29.1,6.211,3.857,278.364,6.73,0.7,30.4,76.88,44903228.0
4,ASM,Oceania,American Samoa,8359.0,34.0,172831.600,702.988,278.205,,,,283.750,,,,73.74,44295.0
5,AND,Europe,Andorra,48015.0,159.0,602280.440,1994.431,163.755,,,,109.135,7.97,29.0,37.8,83.73,79843.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
241,VNM,Asia,Vietnam,11624000.0,43206.0,116612.400,433.444,308.127,32.6,7.150,4.718,245.465,6.00,1.0,45.9,75.40,98186856.0
242,WLF,Oceania,Wallis and Futuna,3760.0,9.0,326928.100,782.541,,,,,,,,,79.94,11596.0
244,YEM,Asia,Yemen,11945.0,2159.0,312.509,56.484,53.508,20.3,2.922,1.583,495.003,5.35,7.6,29.2,66.12,33696612.0
245,ZMB,Africa,Zambia,349842.0,4077.0,17359.357,202.303,22.995,17.7,2.480,1.542,234.499,3.94,3.1,24.7,63.89,20017670.0


In [352]:
#checking for duplicates 
df_covid_cleaned.duplicated().sum()

0

# 2. Supervised machine learning: linear regression

# 3. Supervised machine learning: classification

# 4. Unsupervised machine learning: clustering

# 5. Implementation of the models in a Streamlit application