<a href="https://colab.research.google.com/github/FrenchFreis/CCDATSCL_EXERCISES_COM222/blob/main/Exercise4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 4

This exercise focuses on data visualization and interpretation using a real-world COVID-19 dataset. The dataset contains daily records of confirmed cases, deaths, recoveries, and active cases across countries and regions, along with temporal and geographic information.
The goal of this exercise is not only to create charts, but to choose appropriate visualizations, apply correct data aggregation, and draw meaningful insights from the data. You will work with time-based, categorical, numerical, and geographic variables, and you are expected to think critically about how design choices affect interpretation.

Your visualizations should follow good practices:
- Use clear titles, axis labels, and legends
- Choose chart types appropriate to the data and question
- Avoid misleading scales or cluttered designs
- Clearly explain patterns, trends, or anomalies you observe

Unless stated otherwise, you may filter, aggregate, or group the data as needed.

<img src="https://d3i6fh83elv35t.cloudfront.net/static/2020/03/Screen-Shot-2020-03-05-at-6.29.29-PM-1024x574.png"/>

In [144]:
import kagglehub
import os
import pandas as pd

# Download latest version
path = kagglehub.dataset_download("imdevskp/corona-virus-report")

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'corona-virus-report' dataset.
Path to dataset files: /kaggle/input/corona-virus-report


In [145]:
if os.path.isdir(path):
  print(True)

contents = os.listdir(path)
contents

mydataset = path + "/" + contents[0]
mydataset


df = pd.read_csv(mydataset)

True


In [146]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49068 entries, 0 to 49067
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Province/State  14664 non-null  object 
 1   Country/Region  49068 non-null  object 
 2   Lat             49068 non-null  float64
 3   Long            49068 non-null  float64
 4   Date            49068 non-null  object 
 5   Confirmed       49068 non-null  int64  
 6   Deaths          49068 non-null  int64  
 7   Recovered       49068 non-null  int64  
 8   Active          49068 non-null  int64  
 9   WHO Region      49068 non-null  object 
dtypes: float64(2), int64(4), object(4)
memory usage: 3.7+ MB


In [147]:
df.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered,Active,WHO Region
0,,Afghanistan,33.93911,67.709953,2020-01-22,0,0,0,0,Eastern Mediterranean
1,,Albania,41.1533,20.1683,2020-01-22,0,0,0,0,Europe
2,,Algeria,28.0339,1.6596,2020-01-22,0,0,0,0,Africa
3,,Andorra,42.5063,1.5218,2020-01-22,0,0,0,0,Europe
4,,Angola,-11.2027,17.8739,2020-01-22,0,0,0,0,Africa


In [148]:
df.query("`Country/Region` == 'Philippines'")

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered,Active,WHO Region
180,,Philippines,12.879721,121.774017,2020-01-22,0,0,0,0,Western Pacific
441,,Philippines,12.879721,121.774017,2020-01-23,0,0,0,0,Western Pacific
702,,Philippines,12.879721,121.774017,2020-01-24,0,0,0,0,Western Pacific
963,,Philippines,12.879721,121.774017,2020-01-25,0,0,0,0,Western Pacific
1224,,Philippines,12.879721,121.774017,2020-01-26,0,0,0,0,Western Pacific
...,...,...,...,...,...,...,...,...,...,...
47943,,Philippines,12.879721,121.774017,2020-07-23,74390,1871,24383,48136,Western Pacific
48204,,Philippines,12.879721,121.774017,2020-07-24,76444,1879,24502,50063,Western Pacific
48465,,Philippines,12.879721,121.774017,2020-07-25,78412,1897,25752,50763,Western Pacific
48726,,Philippines,12.879721,121.774017,2020-07-26,80448,1932,26110,52406,Western Pacific


## A. Time-Based Visualizations

1. Global Trend `(5 pts)`

Aggregate the data by Date and create a line chart showing the global number of confirmed COVID-19 cases over time.

In [149]:

import plotly.express as px

global_cases = df.groupby('Date')['Confirmed'].sum().reset_index()

fig = px.line(global_cases, x="Date", y="Confirmed", title="Global Trend of Confirmed COVID-19 Cases Over Time", labels={'Confired':'Total Cases'})

fig.show()


2. Country-Level Trends `(5 pts)`

Select three countries and visualize their confirmed case counts over time on the same plot.

In [150]:
selected_countries = ['US', 'India', 'Brazil']

filtered_df = df[df['Country/Region'].isin(selected_countries)]

filtered_df.groupby(['Date','Country/Region'])['Confirmed'].sum().reset_index()

fig = px.line(filtered_df, x="Date", y="Confirmed", color="Country/Region", title="Global Trend of Confirmed COVID-19 Cases Over Time", labels={'Confired':'Total Cases'})
fig.show()


3. Active vs Recovered `(5 pts)`

For a selected country, create a line chart showing Active and Recovered cases over time.

In [151]:
selected_country_df = df.query("`Country/Region` == 'South Korea'")

df_select = selected_country_df.groupby('Date')[['Active', 'Recovered']].sum().reset_index()

fig = px.line(df_select, x="Date", y=["Active", "Recovered"], title="Active vs Recovered COVID-19 Cases in South Korea", labels={'value': 'Number of Cases', 'variable': 'Status'})
fig.show()


## B: Comparative Visualizations

4. Country Comparison `(5 pts)`

Using data from a single date, create a bar chart showing the top 10 countries by confirmed cases.

In [152]:
country_confirm_df = df.query("`Date` == '2020-06-21'")

country_select_df = country_confirm_df.groupby('Country/Region')['Confirmed'].sum().nlargest(10).reset_index()

fig = px.bar(country_select_df, x="Country/Region", y="Confirmed", color='Country/Region', title="TOP 10 Countries: Trend of Confirmed COVID-19 Cases Over Time", labels={'Confired':'Total Cases', 'Country/Region':'Country'})

fig.show()




5. WHO Region Comparison `(5 pts)`

Aggregate confirmed cases by WHO Region and visualize the result using a bar chart.

In [153]:
WHO_cases = df.groupby('WHO Region')['Confirmed'].sum().reset_index()

fig = px.bar(WHO_cases, x="WHO Region", y="Confirmed", color="WHO Region", title="Trend of Confirmed COVID-19 Cases in WHO Regions Over Time", labels={'Confired':'Total Cases'})

fig.show()

## C. Geographic Visualization

6. Geographic Spread `(10 pts)`

Using Latitude and Longitude, create a map-based visualization showing confirmed cases for a selected date.

In [154]:
country_confirm_df = df.query("`Date` == '2020-07-21'")

country_select_df = country_confirm_df.groupby('Country/Region')['Confirmed'].sum().reset_index()

fig = px.choropleth(
    country_select_df,
    locations="Country/Region",
    locationmode="country names",
    color="Confirmed",
    hover_name="Country/Region",
    projection="natural earth",
)
fig.show()

7. Regional Clustering `(15 pts)`

Create a visualization that shows how confirmed cases are distributed geographically within a single WHO Region.

In [155]:
target_region = 'Europe'
df_region = df[(df["Date"] == '2020-07-21') & (df["WHO Region"] == target_region)]

fig = px.choropleth(
    df_region,
    locations="Country/Region",
    locationmode="country names",
    color="Confirmed",
    hover_name="Country/Region",
    projection="natural earth",
    title=f"Confirmed Cases Distribution in {target_region} Region",
)

fig.show()