# <span style="color:Blue">Assignment-2 of COSC5806: Data Analysis with Python</span>

## <span style="color:Purple">You are allowed to use core Python's built in modules/packages/libraries and NumPy. Not allowed to use any other libraries including pandas, scikit-learn, matplotlib, and Seaborn. Please read the instruction carefully and do not hesitate to contact me if you have any questions.</span>

### <span style="color:Red">Examples and Resources for this assignment:</span>
<ul>
    <li><span style="color:Red">Chapters 3, 4, 5, 6, 7, 8, and 9 from <a href="https://docs.python.org/3/tutorial/index.html">The Python Tutorial</a></span></li>
    <li><span style="color:Red">Chapter 2 from <a href="https://jakevdp.github.io/PythonDataScienceHandbook/02.00-introduction-to-numpy.html">Introduction to NumPy</a></span></li>
</ul>

### <span style="color:Green">Context</span>
This dataset compiles daily snapshots of publicly reported data on 2019 Novel Coronavirus (COVID-19) testing in Ontario. Data includes:

<ul>
    <li><span>date</span></li>
    <li><span>OH region</span></li>
    <li><span>current hospitalizations with COVID-19</span></li>
    <li><span>current patients in Intensive Care Units (ICUs) due to COVID-related critical Illness</span></li>
    <li><span>current patients in Intensive Care Units (ICUs) testing positive for COVID</span></li>
    <li><span>current patients in Intensive Care Units (ICUs) no longer testing positive for COVID</span></li>
    <li><span>current patients in Intensive Care Units (ICUs) on ventilators due to COVID-related critical illness</span></li>
    <li><span>current patients in Intensive Care Units (ICUs) on ventilators testing positive for COVID</span></li>
    <li><span>current patients in Intensive Care Units (ICUs) on ventilators no longer testing positive for COVID</span></li>
</ul>

The following <a href="https://data.ontario.ca/dataset/covid-19-cases-in-hospital-and-icu-by-ontario-health-region">link</a> might be useful for the description of the features.

# <span style="color:Green">P1: Load the dataset.</span>

In [331]:
#read the data from csv file
import csv
import numpy as np
from datetime import datetime
with open(r'D:\Algoma\COSC5806001_DataAnalysis\e760480e-1f95-4634-a923-98161cfb02fa.csv', 'r', encoding="utf-8") as f:
    reader = list(csv.DictReader(f))

#check null value and replace with 0
def replace_empty_with_zero(value):
    if value == "" or value is None :
        return 0
    return value

#replace dot with 0 for icu-current-covid col
def replaceDot(value):
    if value == "" or value is None or value == '.':
        return 0
    return value

# <span style="color:Green">P2: How many unique values exist in the 'oh_region' column? What are the unique values in the 'oh_region' column? How many records exist per 'oh_region'?</span>

In [333]:
#Codes of P2 here
oh_region_records = {}
#get oh-region data 
region_data = [row['oh_region'].strip() for row in reader]
oh_region_array = np.array(region_data)
unique_array = np.unique(oh_region_array)
for region in region_data:
    if region in oh_region_records:
        oh_region_records[region] +=1
    else:
        oh_region_records[region]=1

print('There are {} unique values in the \'oh_region\' column'.format(unique_array.size))
print('Unique values are: ', unique_array)
print('Total records in \'oh_region\' are {}'.format(oh_region_records))

There are 6 unique values in the 'oh_region' column
Unique values are:  ['CENTRAL' 'EAST' 'NORTH EAST' 'NORTH WEST' 'TORONTO' 'WEST']
Total records in 'oh_region' are {'CENTRAL': 1700, 'EAST': 1700, 'NORTH EAST': 1700, 'NORTH WEST': 1700, 'TORONTO': 1700, 'WEST': 1700}


# <span style="color:Green">P3: What is the total number of hospitalizations? What is the average number of hospitalizations per day?</span>

In [335]:
#Codes of P3 here
#read hospitalization col, date col data
hospitalization_data = [float(replace_empty_with_zero(row['hospitalizations'].strip())) for row in reader]
total_hospitalization = np.sum(hospitalization_data)
mean_hospitalization = np.mean(hospitalization_data)

print('Total number of hospitalization :',total_hospitalization)
print('Avg number of hospitalization :', mean_hospitalization)

Total number of hospitalization : 1396853.0
Avg number of hospitalization : 136.9463725490196


# <span style="color:Green">P4: What are the top 5 days with the highest number of hospitalizations?</span>

In [337]:
#Codes of P4 here

#read data of specific columns with name 
hospitalization_data = [float(replace_empty_with_zero(row['hospitalizations'].strip())) for row in reader]
array_hospitalization_data = np.array(hospitalization_data)
sorted = np.argsort(array_hospitalization_data)

date = [datetime.strptime(row['date'].strip(),"%Y-%m-%dT%H:%M:%S") for row in reader]
array_date = np.array(date)

# Slice the arrays reversely and get the last 5 elements
highest_hospitalization = array_hospitalization_data[sorted][-5:]
highest_date = array_date[sorted][-5:]

# Print the top 5 days
print('Top 5 days with the highest number of hospitalizations')
for date,hospitalization in zip(highest_date,highest_hospitalization):
    print('On {}, the number of hospitalization is: {}'.format(f"{date.year}-{date.month:02d}-{date.day}", hospitalization))


Top 5 days with the highest number of hospitalizations
On 2022-01-23, the number of hospitalization is: 1205.0
On 2022-01-18, the number of hospitalization is: 1211.0
On 2022-01-16, the number of hospitalization is: 1221.0
On 2022-01-17, the number of hospitalization is: 1231.0
On 2022-01-19, the number of hospitalization is: 1239.0


# <span style="color:Green">P5: What are the top 5 days with the highest number of ICU COVID ('icu_current_covid') cases?</span>

In [339]:
#Codes of P5 here
#read data of specific columns with name and empty list to store
list_icu_date =[]
for row in reader:
    date,icu_current_data = datetime.strptime(row['date'].strip(),"%Y-%m-%dT%H:%M:%S"), row['icu_current_covid'].strip() 
    if icu_current_data != '.' :
        list_icu_date.append([date,icu_current_data])

#convert list into arrays
date_array = np.array([sublist[0] for sublist in list_icu_date])
icu_array = np.array([sublist[1] for sublist in list_icu_date])

sorted = np.argsort(icu_array.astype(float)) # get sorted indices of icu_current array and convert array from str to float for sorting
# Slice the arrays reversely and get the last 5 elements
highest_icu_current = icu_array[sorted][-5:]
highest_date = date_array[sorted][-5:]
# Print the top 5 days
print('Top 5 days with the highest number of icu current covid')
for date,icu_current in zip(highest_date,highest_icu_current):
    print('On {} , the number of icu current is: {}'.format(f"{date.year}-{date.month:02d}-{date.day:02d}", icu_current))

Top 5 days with the highest number of icu current covid
On 2021-05-01 , the number of icu current is: 268
On 2021-05-03 , the number of icu current is: 268
On 2021-05-04 , the number of icu current is: 272
On 2021-05-07 , the number of icu current is: 277
On 2021-05-08 , the number of icu current is: 278


# <span style="color:Green">P6: How many patients were in ICU but not on ventilators?</span>

In [341]:
#Codes of P6 here
#read data of specific columns with name 
icu_current_data = [replaceDot(row['icu_current_covid'].strip()) for row in reader]
icu_current_covid_vented = [replaceDot(row['icu_current_covid_vented'].strip()) for row in reader] 

cleaned_icu_array = np.array(icu_current_data).astype(float) #convert items of cleaned array from str to float 
cleaned_icu_vented_array =np.array(icu_current_covid_vented).astype(float) #convert items of cleaned array from str to float 

print('Avg number of patients who were in ICU but not on ventilators is: ',np.mean(cleaned_icu_array-cleaned_icu_vented_array))
print('Total number of patients who were in ICU but not on ventilators is: ',np.sum(cleaned_icu_array-cleaned_icu_vented_array))

Avg number of patients who were in ICU but not on ventilators is:  9.090882352941177
Total number of patients who were in ICU but not on ventilators is:  92727.0


# <span style="color:Green">P7: Are there any seasonal (Winter, Spring, Summer, Fall) patterns in COVID-19 hospitalizations? </span>

In [343]:
#Codes of P7 here
seasons = {12: 'Winter', 1: 'Winter', 2: 'Winter',
           3: 'Spring', 4: 'Spring', 5: 'Spring',
           6: 'Summer', 7: 'Summer', 8: 'Summer',
           9: 'Fall', 10: 'Fall', 11: 'Fall'}

#get date column and get list of all year and month , get hospitalization col
seasonal_data = []
for row in reader:
    date_obj,hospitalization_data = np.datetime64(row['date'].strip()) ,float(replace_empty_with_zero(row['hospitalizations'].strip())) 
    # Extract month
    month = date_obj.astype('datetime64[M]').astype(int) % 12 + 1
    
    if month in seasons:  # Ensure month is in seasons
        seasonal_data.append([seasons[month], hospitalization_data]) 
    else:
        pass
#convert list into arrays
seasonData_array = np.array([sublist[0] for sublist in seasonal_data])
hospitalization_array = np.array([sublist[1] for sublist in seasonal_data])
#create 2D array with both array data
season_hospital_array =  np.array([seasonData_array, hospitalization_array])
#get 4 unique season names 
unique_seasons = np.unique(np.array(list(seasons.values())))

#get numbers of hospitalization per season
pattern = {}
for season in unique_seasons:
    index = np.where(season == season_hospital_array[0])
    sum = np.sum((season_hospital_array[1,index]).astype(float)) 
    pattern[season] = sum
    
print('Seasonal (Winter, Spring, Summer, Fall) patterns in COVID-19 hospitalizations are ')
for season,hospitalization in pattern.items():
    print('On {} : {}'.format(season, hospitalization))

Seasonal (Winter, Spring, Summer, Fall) patterns in COVID-19 hospitalizations are 
On Fall : 347340.0
On Spring : 376684.0
On Summer : 204219.0
On Winter : 468610.0


# <span style="color:Green">P8: What is the average number of ICU patients on ventilators ('icu_current_covid_vented') per region? </span>

In [345]:
#Codes of P8 here
icu_current_covid_vented = [replaceDot(row['icu_current_covid_vented'].strip()) for row in reader]  #read data col
cleaned_icu_vented_array = np.array(icu_current_covid_vented).astype(int)#convert items to cleaned array and type from str to int 

#get oh-region data 
region_data = [row['oh_region'].strip() for row in reader]
oh_region_array = np.array(region_data)
unique_region_array = np.unique(oh_region_array)
#combined the arrays and create empty result dictionary
result ={}
combined_array = np.array([oh_region_array, cleaned_icu_vented_array])

for region in unique_region_array:
    index = np.where(region == combined_array[0])
    mean = np.mean((combined_array[1, index]).astype(int))
    result[region] = mean
print('Average number of ICU patients on ventilators per region')
for region,avg in result.items():
    print(' {} : {}'.format(region, avg))

Average number of ICU patients on ventilators per region
 CENTRAL : 17.311764705882354
 EAST : 10.749411764705883
 NORTH EAST : 1.6705882352941177
 NORTH WEST : 0.841764705882353
 TORONTO : 24.89235294117647
 WEST : 21.956470588235295


# <span style="color:Green">P9: What are the busiest regions based on ICU occupancy?</span>

In [347]:
#Codes of P9 here

#read icu_current_covid col
icu_current_covid = [replaceDot(row['icu_current_covid'].strip()) for row in reader]  
cleaned_icu_current_covid_array = np.array(icu_current_covid).astype(int)#convert items of cleaned array from str to int 

#get oh-region data 
region_data = [row['oh_region'].strip() for row in reader]
oh_region_array = np.array(region_data)
unique_region_array = np.unique(oh_region_array)

#combined the arrays and create result dictionary with region as key and total icu no as values
result_icu_per_region = {}
combined_array = np.array([oh_region_array, cleaned_icu_current_covid_array])
for region in unique_region_array:
    index = np.where(region == combined_array[0])
    total = np.sum((combined_array[1, index]).astype(int))
    result_icu_per_region[region] = total

# Find the max value of total icu
max_icu_value = max(result_icu_per_region.values())

# Find all months with the max value
highest_regions = [ region for region, totalicuNo in result_icu_per_region.items() if totalicuNo == max_icu_value] 

print('The regions with ICU occupancy are: ' , result_icu_per_region)
print(f'The busiest regions based on ICU occupancy is/are: {highest_regions} with {max_icu_value}' )


The regions with ICU occupancy are:  {'CENTRAL': 48648, 'EAST': 35893, 'NORTH EAST': 7834, 'NORTH WEST': 3021, 'TORONTO': 65005, 'WEST': 63944}
The busiest regions based on ICU occupancy is/are: ['TORONTO'] with 65005


# <span style="color:Green">P10: Which months had the highest number of hospitalizations? </span>

In [349]:
#Codes of P10 here

#get date column and get list of all year and month , get hospitalization col
columns_data = []
for row in reader:
    date_obj,hospitalization_data = np.datetime64(row['date'].strip()) ,float(replace_empty_with_zero(row['hospitalizations'].strip())) 
    # Extract year and month
    year = date_obj.astype('datetime64[Y]').astype(int) + 1970
    month = date_obj.astype('datetime64[M]').astype(int) % 12 + 1
    columns_data.append([f"{year}-{month:02d}", hospitalization_data])

# convert each into array from each sublist
date_array = np.array([sublist[0] for sublist in columns_data])
hospitalization_array = np.array([sublist[1] for sublist in columns_data])
#create 2D array with both array data
combined_2array =  np.array([date_array, hospitalization_array])
#extract months and remove duplicate
unique_months = np.unique(date_array)

#get total no of hospitalizations per month
hospitalization_perMonth = {}
for month in unique_months:
    index = np.where(month == combined_2array[0])
    sum = np.sum((combined_2array[1, index]).astype(float))
    hospitalization_perMonth[month] = sum

# Find the max value of total hospitalization
max_value = max(hospitalization_perMonth.values())

# Find all months with the max value
max_months = [month for month, totalNo in hospitalization_perMonth.items() if totalNo == max_value] 
#print('The number of hospitalizations per month: ')
#print(hospitalization_perMonth)
print(f'The Highest number of hospitalizations is in the month of: {max_months} with {max_value}' )

The Highest number of hospitalizations is in the month of: ['2022-01'] with 104620.0
