# Final Assignment
The following script will load data regarding COVID-19-related deaths grouped by vaccination status and age, and briefly visualize and analyze potential differences in death rates given different vaccination status and age groups.
Obtained from the Government of Ontario Data Catalogue, the link to the CSV file can be found on README.md  

## Importing libraries

In [12]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.cbook as cbook
import seaborn as sns

## Importing data set and understanding
First, we import the CSV file using pandas and explore the contents

In [4]:
# Load the dataset
dataset = pd.read_csv('2aa6e2ce-40de-4910-a737-81762e014b0b.csv')

# Display the first few rows of the dataset
dataset.head()

Unnamed: 0,_id,date,age_group,deaths_boost_vac_rate_7ma,deaths_full_vac_rate_7ma,deaths_not_full_vac_rate_7ma
0,8858,2021-03-01T00:00:00,0-4yrs,0.0,0.0,0.0
1,8859,2021-03-01T00:00:00,5-11yrs,0.0,0.0,0.0
2,8860,2021-03-01T00:00:00,12-17yrs,0.0,0.0,0.0
3,8861,2021-03-01T00:00:00,18-39yrs,0.0,0.0,0.0
4,8862,2021-03-01T00:00:00,40-59yrs,0.0,0.0,0.02


In [6]:
# Display the length of the dataset
len(dataset)

8892

In [7]:
# Display general information about the dataset and checking for null values
print("\nGeneral info about the dataset:")
print(dataset.info())


General info about the dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8892 entries, 0 to 8891
Data columns (total 6 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   _id                           8892 non-null   int64  
 1   date                          8892 non-null   object 
 2   age_group                     8892 non-null   object 
 3   deaths_boost_vac_rate_7ma     8892 non-null   float64
 4   deaths_full_vac_rate_7ma      8892 non-null   float64
 5   deaths_not_full_vac_rate_7ma  8892 non-null   float64
dtypes: float64(3), int64(1), object(2)
memory usage: 416.9+ KB
None


It appears that no null values are present in the data, we can double check that with:

In [8]:
# Check for missing values in the dataset
print("\nMissing values in the dataset:")
print(dataset.isnull().sum())


Missing values in the dataset:
_id                             0
date                            0
age_group                       0
deaths_boost_vac_rate_7ma       0
deaths_full_vac_rate_7ma        0
deaths_not_full_vac_rate_7ma    0
dtype: int64


Now we check the data types, this will be helpful to understand how to use the values stored in every column

In [10]:
# Check the column names and data types
print("\nColumn names and data types:")
print(dataset.dtypes)


Column names and data types:
_id                               int64
date                             object
age_group                        object
deaths_boost_vac_rate_7ma       float64
deaths_full_vac_rate_7ma        float64
deaths_not_full_vac_rate_7ma    float64
dtype: object


Now we convert the date column to datetime variable

In [13]:
# Convert the 'date' column to datetime
dataset['date'] = pd.to_datetime(dataset['date'])

Summary statistics are helpful in understanding central tendency and dispersion of the data in the columns

In [9]:
# Summary statistics for numerical columns
print("\nSummary statistics for numerical columns:")
print(dataset.describe())


Summary statistics for numerical columns:
               _id  deaths_boost_vac_rate_7ma  deaths_full_vac_rate_7ma  \
count   8892.00000                8892.000000               8892.000000   
mean   13303.50000                   0.024245                  0.023695   
std     2567.04363                   0.067461                  0.109072   
min     8858.00000                   0.000000                  0.000000   
25%    11080.75000                   0.000000                  0.000000   
50%    13303.50000                   0.000000                  0.000000   
75%    15526.25000                   0.010000                  0.000000   
max    17749.00000                   0.810000                  1.970000   

       deaths_not_full_vac_rate_7ma  
count                   8892.000000  
mean                       0.261889  
std                        1.218294  
min                        0.000000  
25%                        0.000000  
50%                        0.000000  
75%            

We can tell right away that there seems to be an important difference between the mean death rates of the non-vaccinated vs the vaccinated group [0.26 -0.024). We will explore this further down bellow

## Data Visualization
Exploratoring trends and relationships in the data through data visualization