In [5]:
import plotly.io as pio

pio.renderers.default = "vscode+jupyterlab+notebook_connected"

# **Impacts of achieving full COVID-19 vaccination status**

## **Project Overview**
I adopted the data which includes how COVID-19 vaccination affects the health outcomes. This project features on analyzing the "Fully vaccinated: Completion of primary series of a U.S. Food and Drug Administration (FDA)-authorized or approved COVID-19 vaccine at least 14 days prior to a positive test (with no other positive tests in the previous 45 days)" impact on outcomes of COVID-19 below.
Outcomes: 

• Cases: People with a positive molecular (PCR) or antigen COVID-19 test result from an FDA-authorized COVID-19 test that was reported into I-NEDSS. A person can become re-infected with SARS-CoV-2 over time and so may be counted more than once in this dataset. Cases are counted by week the test specimen was collected.

• Hospitalizations: COVID-19 cases who are hospitalized due to a documented COVID-19 related illness or who are admitted for any reason within 14 days of a positive SARS-CoV-2 test. Hospitalizations are counted by week of hospital admission.

• Deaths: COVID-19 cases who died from COVID-19-related health complications as determined by vital records or a public health investigation. Deaths are counted by week of death.

Citation: [Data GOV | City of Chicago](https://catalog.data.gov/dataset/covid-19-outcomes-by-vaccination-status#).

## **1.Using pandas:**
### **1. Read in the data.**
Load the file 'COVID-19_Outcomes_by_Vaccination_Status_-_Historical.csv' into a pandas DataFrame called books covid_19.

In [None]:
import pandas as pd

covid_19 = pd.read_csv('COVID-19_Outcomes_by_Vaccination_Status_-_Historical.csv', encoding='ISO-8859-1')

### **2. Data Cleaning.**
For the numeric processing, replace the missing values to 'NaN'.

In [None]:
import pandas as pd
import numpy as np

covid_19 = pd.read_csv('COVID-19_Outcomes_by_Vaccination_Status_-_Historical.csv', encoding='ISO-8859-1')
covid_19.replace("..", np.nan, inplace=True)
covid_19.head(30)

Unnamed: 0,Outcome,Week End,Age Group,Unvaccinated Rate,Vaccinated Rate,Boosted Rate,Crude Vaccinated Ratio,Crude Boosted Ratio,Age-Adjusted Unvaccinated Rate,Age-Adjusted Vaccinated Rate,...,Age-Adjusted Vaccinated Ratio,Age-Adjusted Boosted Ratio,Population Unvaccinated,Population Vaccinated,Population Boosted,Outcome Unvaccinated,Outcome Vaccinated,Outcome Boosted,Age Group Min,Age Group Max
0,Deaths,07/09/2022,0-4,0.0,,,,,,,...,,,162642,,,0,,,0,4
1,Cases,11/12/2022,0-4,82.4,5.5,,15.0,,,,...,,,162642,,,134,9.0,,0,4
2,Cases,02/26/2022,0-4,54.1,,,,,,,...,,,162642,,,88,,,0,4
3,Hospitalizations,12/11/2021,0-4,3.1,,,,,,,...,,,162642,,,5,,,0,4
4,Cases,11/20/2021,0-4,104.5,,,,,,,...,,,162642,,,170,,,0,4
5,Cases,10/23/2021,5-11,111.7,,,,,,,...,,,210448,,,235,,,5,11
6,Deaths,07/23/2022,0-4,0.0,0.0,,,,,,...,,,162642,,,0,0.0,,0,4
7,Deaths,07/16/2022,0-4,0.0,,,,,,,...,,,162642,,,0,,,0,4
8,Cases,04/03/2021,0-4,73.8,,,,,,,...,,,162642,,,120,,,0,4
9,Deaths,07/02/2022,0-4,0.0,,,,,,,...,,,162642,,,0,,,0,4


### **3.Compute the mean, the median and the mode.**
3-1) Review the column names and examine the initial rows to understand the data structure.

In [None]:
covid_19.columns

Index(['Outcome', 'Week End', 'Age Group', 'Unvaccinated Rate',
       'Vaccinated Rate', 'Boosted Rate', 'Crude Vaccinated Ratio',
       'Crude Boosted Ratio', 'Age-Adjusted Unvaccinated Rate',
       'Age-Adjusted Vaccinated Rate', 'Age-Adjusted Boosted Rate',
       'Age-Adjusted Vaccinated Ratio', 'Age-Adjusted Boosted Ratio',
       'Population Unvaccinated', 'Population Vaccinated',
       'Population Boosted', 'Outcome Unvaccinated', 'Outcome Vaccinated',
       'Outcome Boosted', 'Age Group Min', 'Age Group Max'],
      dtype='object')

In [None]:
covid_19.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3753 entries, 0 to 3752
Data columns (total 21 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Outcome                         3753 non-null   object 
 1   Week End                        3753 non-null   object 
 2   Age Group                       3753 non-null   object 
 3   Unvaccinated Rate               3753 non-null   float64
 4   Vaccinated Rate                 3426 non-null   float64
 5   Boosted Rate                    2529 non-null   float64
 6   Crude Vaccinated Ratio          2320 non-null   float64
 7   Crude Boosted Ratio             1761 non-null   float64
 8   Age-Adjusted Unvaccinated Rate  417 non-null    float64
 9   Age-Adjusted Vaccinated Rate    417 non-null    float64
 10  Age-Adjusted Boosted Rate       330 non-null    float64
 11  Age-Adjusted Vaccinated Ratio   380 non-null    float64
 12  Age-Adjusted Boosted Ratio      31

3-2) Select the data.

In [None]:
covid_19[['Outcome', 'Age Group', 'Vaccinated Rate', 'Population Vaccinated']]

Unnamed: 0,Outcome,Age Group,Vaccinated Rate,Population Vaccinated
0,Deaths,0-4,,
1,Cases,0-4,5.5,
2,Cases,0-4,,
3,Hospitalizations,0-4,,
4,Cases,0-4,,
...,...,...,...,...
3748,Deaths,65-79,5.5,146087.0
3749,Cases,18-29,17.7,214130.0
3750,Deaths,80+,3.8,52266.0
3751,Deaths,80+,1.9,52848.0


3-3) Remove rows with missing data in the 'Population Vaccinated' column.

In [None]:
filtered_covid_19= covid_19[['Outcome', 'Age Group', 'Population Vaccinated']].dropna(subset=['Population Vaccinated'])
filtered_covid_19

Unnamed: 0,Outcome,Age Group,Population Vaccinated
10,Hospitalizations,80+,52936.0
11,Deaths,All,842535.0
12,Deaths,80+,52117.0
17,Deaths,80+,40458.0
18,Hospitalizations,80+,53008.0
...,...,...,...
3748,Deaths,65-79,146087.0
3749,Cases,18-29,214130.0
3750,Deaths,80+,52266.0
3751,Deaths,80+,52848.0


3-4) Sort by the 'Age Group' and 'Outcome' columns.

In [None]:
sorted_covid_19 = filtered_covid_19.sort_values(by=['Age Group', 'Outcome'], ascending=[True, False])
sorted_covid_19

Unnamed: 0,Outcome,Age Group,Population Vaccinated
40,Hospitalizations,12-17,87028.0
93,Hospitalizations,12-17,86108.0
441,Hospitalizations,12-17,95178.0
471,Hospitalizations,12-17,86264.0
517,Hospitalizations,12-17,95783.0
...,...,...,...
3646,Cases,All,1496643.0
3653,Cases,All,1437082.0
3656,Cases,All,878533.0
3693,Cases,All,396727.0


3-5) Display outcomes by age group using a pivot table.

In [None]:
age_group_outcomes = filtered_covid_19.groupby(['Age Group', 'Outcome']).sum().unstack(fill_value=0)

print(age_group_outcomes)


          Population Vaccinated                              
Outcome                   Cases       Deaths Hospitalizations
Age Group                                                    
12-17                11362972.0   11190060.0       11276529.0
18-29                31525118.0   31093965.0       31309490.0
30-49                45662142.0   45103339.0       45382724.0
5-11                  8363078.0    8199577.0        8281393.0
50-64                23831625.0   23580296.0       23705992.0
65-79                11096345.0   11007788.0       11052036.0
80+                   2989564.0    2966268.0        2977906.0
All                 135897681.0  134172318.0      135034976.0


3-6) Check the column names and display the result.

In [None]:
print(age_group_outcomes.columns)

MultiIndex([('Population Vaccinated',            'Cases'),
            ('Population Vaccinated',           'Deaths'),
            ('Population Vaccinated', 'Hospitalizations')],
           names=[None, 'Outcome'])


In [None]:
age_group_outcomes[('Population Vaccinated', 'Deaths')]

Age Group
12-17     11190060.0
18-29     31093965.0
30-49     45103339.0
5-11       8199577.0
50-64     23580296.0
65-79     11007788.0
80+        2966268.0
All      134172318.0
Name: (Population Vaccinated, Deaths), dtype: float64

3-7) Calculate the mean, median, and mode of Deaths across Age Groups.

In [None]:
mean_value = age_group_outcomes[('Population Vaccinated','Deaths')].dropna().astype(float).mean()
print("Mean:", mean_value)

median_value = age_group_outcomes[('Population Vaccinated','Deaths')].dropna().astype(float).median()
print("Median:", median_value)

mode_value = age_group_outcomes[('Population Vaccinated','Deaths')].dropna().astype(float).mode()[0]
print("Mode:", mode_value)

Mean: 33414201.375
Median: 17385178.0
Mode: 2966268.0


## **2. Using Python standard library**
### **1. Read in the data**
1-a) Import the CSV file and create a dictionary. Use a function to provide default values.
Import the CSV file. Exclude rows with missing data in the 'Population Vaccinated' column, and sum values by outcome categories.

In [None]:
import csv
from collections import defaultdict
from statistics import mean, median, mode

In [None]:
def default_age_group_data():
    return {'Deaths': 0, 'Cases': 0, 'Population Vaccinated': 0}

age_group_data = defaultdict(default_age_group_data)

with open('COVID-19_Outcomes_by_Vaccination_Status_-_Historical.csv', mode='r', encoding='ISO-8859-1') as file:
    reader = csv.DictReader(file)
    for row in reader:
        outcome = row['Outcome']
        age_group = row['Age Group']
        population_vaccinated = row['Population Vaccinated']

        if population_vaccinated and population_vaccinated not in ("..", "NaN", "NA"):
            try:
                population_vaccinated = int(population_vaccinated)
            except ValueError:
                continue 

            if outcome == 'Deaths':
                age_group_data[age_group]['Deaths'] += population_vaccinated 
            elif outcome == 'Cases':
                age_group_data[age_group]['Cases'] += population_vaccinated
            age_group_data[age_group]['Population Vaccinated'] += population_vaccinated

print(age_group_data)

defaultdict(<function default_age_group_data at 0x13c03ca40>, {'80+': {'Deaths': 2966268, 'Cases': 2989564, 'Population Vaccinated': 8933738}, 'All': {'Deaths': 134172318, 'Cases': 135897681, 'Population Vaccinated': 405104975}, '50-64': {'Deaths': 23580296, 'Cases': 23831625, 'Population Vaccinated': 71117913}, '5-11': {'Deaths': 8199577, 'Cases': 8363078, 'Population Vaccinated': 24844048}, '18-29': {'Deaths': 31093965, 'Cases': 31525118, 'Population Vaccinated': 93928573}, '12-17': {'Deaths': 11190060, 'Cases': 11362972, 'Population Vaccinated': 33829561}, '30-49': {'Deaths': 45103339, 'Cases': 45662142, 'Population Vaccinated': 136148205}, '65-79': {'Deaths': 11007788, 'Cases': 11096345, 'Population Vaccinated': 33156169}})


1-b) Create the new list for total of death in each age group and output the mean, median, and mode.

In [None]:
age_group_outcomes = {}
for age_group, totals in age_group_data.items():
    deaths = totals['Deaths']

print("Age Group Outcomes:")
print("Age Group  | Deaths")
print("-------------------------------")
for age_group, outcomes in sorted(age_group_outcomes.items()):
    print(f"{age_group:<10} | {outcomes['Deaths']:<6} ")

deaths = [
    outcomes['Deaths']
    for outcomes in age_group_data.values()
    if outcomes['Deaths'] is not None
]
#　Mean
mean_value = mean_value = sum(deaths) / len(deaths)
print("Mean:", mean_value)

#　Median
sorted_deaths = sorted(deaths)
n = len(sorted_deaths)
if n % 2 == 0:
    median_value = (sorted_deaths[n // 2 - 1] + sorted_deaths[n // 2]) / 2
else:
    median_value = sorted_deaths[n // 2]
print("Median:", median_value)

#　Mode　(Display the smallest mode.)
frequency = {}
for death in deaths:
    if death in frequency:
        frequency[death] += 1
    else:
        frequency[death] = 1

max_frequency = max(frequency.values())
mode_value = [key for key, val in frequency.items() if val == max_frequency]
mode_value = float(min(mode_value))
print("Mode:", mode_value)

Age Group Outcomes:
Age Group  | Deaths
-------------------------------
Mean: 33414201.375
Median: 17385178.0
Mode: 2966268.0


## **３. Data Visualization**
### **1. Read in the data.**
1-1 pick up the key and value conbination which I want to visualize and sorted by the values.

In [None]:
import pandas as pd
import numpy as np
age_group_outcomes = {}
for age_group, totals in age_group_data.items():
    deaths = totals['Deaths']

    age_group_outcomes[age_group] = {
        'Deaths': deaths
    }

In [None]:
filtered_age_group = {age_group: outcomes['Deaths'] for age_group, outcomes in age_group_outcomes.items()}
filtered_age_group

{'80+': 2966268,
 'All': 134172318,
 '50-64': 23580296,
 '5-11': 8199577,
 '18-29': 31093965,
 '12-17': 11190060,
 '30-49': 45103339,
 '65-79': 11007788}

In [None]:
sorted_data = dict(sorted(filtered_age_group.items(), key=lambda item: item[1], reverse=True))
print(sorted_data)

{'All': 134172318, '30-49': 45103339, '18-29': 31093965, '50-64': 23580296, '12-17': 11190060, '65-79': 11007788, '5-11': 8199577, '80+': 2966268}


### **2. Create a line with sparkle to visualize the data.**

In [None]:
max_value = max(sorted_data.values())
max_sparkle_length = 50 

for age_group, value in sorted_data.items():
    sparkle_length = int(value / max_value * max_sparkle_length) 
    sparkle = '*' * sparkle_length
    print(f"{age_group:<10} | {sparkle} {value}")

All        | ************************************************** 134172318
30-49      | **************** 45103339
18-29      | *********** 31093965
50-64      | ******** 23580296
12-17      | **** 11190060
65-79      | **** 11007788
5-11       | *** 8199577
80+        | * 2966268


## **4. Conclusion**

・The mean, median, and mode of COVID-19 death cases across age groups were as follows: Mean = 33,414,201.375, Median = 17,385,178.0, and  Mode = 2,966,268.0.

・The highest number of cases was in the 30-49 age group (45,103,39), followed by the 18-29 age group (31,093,965).

・The lowest number of cases was in the 80+ age group (2,966,268).　　

・For further study, calculating the ratio of deaths to cases is necessary. This analysis will help identify which age group is more susceptible to COVID-19-related deaths.　　