# Exercise 4

This exercise focuses on data visualization and interpretation using a real-world COVID-19 dataset. The dataset contains daily records of confirmed cases, deaths, recoveries, and active cases across countries and regions, along with temporal and geographic information.
The goal of this exercise is not only to create charts, but to choose appropriate visualizations, apply correct data aggregation, and draw meaningful insights from the data. You will work with time-based, categorical, numerical, and geographic variables, and you are expected to think critically about how design choices affect interpretation.

Your visualizations should follow good practices:
- Use clear titles, axis labels, and legends
- Choose chart types appropriate to the data and question
- Avoid misleading scales or cluttered designs
- Clearly explain patterns, trends, or anomalies you observe

Unless stated otherwise, you may filter, aggregate, or group the data as needed.

<img src="https://d3i6fh83elv35t.cloudfront.net/static/2020/03/Screen-Shot-2020-03-05-at-6.29.29-PM-1024x574.png"/>

In [90]:
import kagglehub
import os
import pandas as pd

# Download latest version
path = kagglehub.dataset_download("imdevskp/corona-virus-report")

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'corona-virus-report' dataset.
Path to dataset files: /kaggle/input/corona-virus-report


In [91]:
if os.path.isdir(path):
  print(True)

contents = os.listdir(path)
contents

mydataset = path + "/" + contents[0]
mydataset


df = pd.read_csv(mydataset)

True


In [92]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49068 entries, 0 to 49067
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Province/State  14664 non-null  object 
 1   Country/Region  49068 non-null  object 
 2   Lat             49068 non-null  float64
 3   Long            49068 non-null  float64
 4   Date            49068 non-null  object 
 5   Confirmed       49068 non-null  int64  
 6   Deaths          49068 non-null  int64  
 7   Recovered       49068 non-null  int64  
 8   Active          49068 non-null  int64  
 9   WHO Region      49068 non-null  object 
dtypes: float64(2), int64(4), object(4)
memory usage: 3.7+ MB


In [93]:
df.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered,Active,WHO Region
0,,Afghanistan,33.93911,67.709953,2020-01-22,0,0,0,0,Eastern Mediterranean
1,,Albania,41.1533,20.1683,2020-01-22,0,0,0,0,Europe
2,,Algeria,28.0339,1.6596,2020-01-22,0,0,0,0,Africa
3,,Andorra,42.5063,1.5218,2020-01-22,0,0,0,0,Europe
4,,Angola,-11.2027,17.8739,2020-01-22,0,0,0,0,Africa


In [94]:
df.query("`Country/Region` == 'Philippines'")

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered,Active,WHO Region
180,,Philippines,12.879721,121.774017,2020-01-22,0,0,0,0,Western Pacific
441,,Philippines,12.879721,121.774017,2020-01-23,0,0,0,0,Western Pacific
702,,Philippines,12.879721,121.774017,2020-01-24,0,0,0,0,Western Pacific
963,,Philippines,12.879721,121.774017,2020-01-25,0,0,0,0,Western Pacific
1224,,Philippines,12.879721,121.774017,2020-01-26,0,0,0,0,Western Pacific
...,...,...,...,...,...,...,...,...,...,...
47943,,Philippines,12.879721,121.774017,2020-07-23,74390,1871,24383,48136,Western Pacific
48204,,Philippines,12.879721,121.774017,2020-07-24,76444,1879,24502,50063,Western Pacific
48465,,Philippines,12.879721,121.774017,2020-07-25,78412,1897,25752,50763,Western Pacific
48726,,Philippines,12.879721,121.774017,2020-07-26,80448,1932,26110,52406,Western Pacific


In [95]:
df["Date"] = pd.to_datetime(df["Date"])

latest_date = df["Date"].max()

## A. Time-Based Visualizations

1. Global Trend `(5 pts)`

Aggregate the data by Date and create a line chart showing the global number of confirmed COVID-19 cases over time.

In [96]:
from bokeh.plotting import figure
from bokeh.io import output_notebook, push_notebook, show
from bokeh.models import Legend
from bokeh.models.formatters import PrintfTickFormatter
import plotly.express as px


output_notebook()

In [97]:
global_cases = df.groupby("Date")["Confirmed"].sum().reset_index()

p = figure(
    title="Global Confirmed COVID-19 Cases Over Time (Log Scale)",
    x_axis_type="datetime",
    y_axis_type="log",
    x_axis_label="Date",
    y_axis_label="Confirmed Cases"
)

p.line(global_cases["Date"], global_cases["Confirmed"], line_width=2, color="navy")
p.yaxis.formatter = PrintfTickFormatter(format="%d")

show(p)


2. Country-Level Trends `(5 pts)`

Select three countries and visualize their confirmed case counts over time on the same plot.

In [98]:
countries = ["US", "India", "Brazil"]
colors = {"US": "red", "India": "green", "Brazil": "blue"}

country_df = df[df["Country/Region"].isin(countries)]
country_trends = country_df.groupby(["Date", "Country/Region"])["Confirmed"].sum().reset_index()

p = figure(
    title="Confirmed COVID-19 Cases Over Time by Country (Log Scale)",
    x_axis_type="datetime",
    y_axis_type="log",
    x_axis_label="Date",
    y_axis_label="Confirmed Cases"
)

lines = []
for country in countries:
    subset = country_trends[country_trends["Country/Region"] == country]
    line = p.line(subset["Date"], subset["Confirmed"], line_width=2, color=colors[country])
    lines.append((country, [line]))

p.add_layout(Legend(items=lines))
p.yaxis.formatter = PrintfTickFormatter(format="%d")
show(p)


3. Active vs Recovered `(5 pts)`

For a selected country, create a line chart showing Active and Recovered cases over time.

In [106]:
country = "Afghanistan"

country_df = df[df["Country/Region"] == country]
trend = country_df.groupby("Date")[["Active", "Recovered"]].sum().reset_index()

p = figure(
    title=f"Active vs Recovered COVID-19 Cases Over Time in {country} (Log Scale)",
    x_axis_type="datetime",
    y_axis_type="log",
    x_axis_label="Date",
    y_axis_label="Number of Cases"
)

p.line(trend["Date"], trend["Active"], line_width=2, color="orange", legend_label="Active")
p.line(trend["Date"], trend["Recovered"], line_width=2, color="green", legend_label="Recovered")

p.legend.location = "top_left"
p.yaxis.formatter = PrintfTickFormatter(format="%d")
show(p)


## B: Comparative Visualizations

4. Country Comparison `(5 pts)`

Using data from a single date, create a bar chart showing the top 10 countries by confirmed cases.

In [100]:
selected_date = df["Date"].max()

date_df = df[df["Date"] == selected_date]
country_totals = date_df.groupby("Country/Region")["Confirmed"].sum().reset_index()
top_10 = country_totals.sort_values(by="Confirmed", ascending=False).head(10)

countries = top_10["Country/Region"].tolist()
cases = top_10["Confirmed"].tolist()

p = figure(
    x_range=countries,
    title=f"Top 10 Countries by Confirmed COVID-19 Cases on {selected_date.date()}",
    x_axis_label="Country",
    y_axis_label="Confirmed Cases"
)

p.vbar(x=countries, top=cases, width=0.8, fill_color="teal")
p.xaxis.major_label_orientation = 1.0
p.yaxis.formatter = PrintfTickFormatter(format="%d")
show(p)


5. WHO Region Comparison `(5 pts)`

Aggregate confirmed cases by WHO Region and visualize the result using a bar chart.

In [101]:
region_totals = df.groupby("WHO Region")["Confirmed"].sum().reset_index()

region_totals = region_totals.sort_values(by="Confirmed", ascending=False)

regions = region_totals["WHO Region"].tolist()
cases = region_totals["Confirmed"].tolist()

p = figure(
    x_range=regions,
    title="Confirmed COVID-19 Cases by WHO Region",
    x_axis_label="WHO Region",
    y_axis_label="Confirmed Cases"
)

p.vbar(x=regions, top=cases, width=0.8, fill_color="orange")
p.xaxis.major_label_orientation = 1.0
p.yaxis.formatter = PrintfTickFormatter(format="%d")

show(p)

## C. Geographic Visualization

6. Geographic Spread `(10 pts)`

Using Latitude and Longitude, create a map-based visualization showing confirmed cases for a selected date.

In [102]:
selected_date = df["Date"].max()

map_df = df[df["Date"] == selected_date]

map_df = map_df.dropna(subset=["Lat", "Long", "Confirmed"])

source = ColumnDataSource(map_df)


In [103]:
fig = px.choropleth(
    map_df,
    locations="Country/Region",
    locationmode="country names",
    color="Confirmed",
    hover_name="Country/Region",
    color_continuous_scale="Viridis",
    range_color=[map_df["Confirmed"].min(), map_df["Confirmed"].max()],
    title=f"Global COVID-19 Confirmed Cases on {selected_date.date()}",
    projection="natural earth"
)

fig.show()

7. Regional Clustering `(15 pts)`

Create a visualization that shows how confirmed cases are distributed geographically within a single WHO Region.

In [104]:
selected_date = df["Date"].max()
map_df = df[df["Date"] == selected_date].dropna(subset=["Confirmed", "WHO Region"])

continent_totals = map_df.groupby("WHO Region")["Confirmed"].sum().reset_index()

map_df = map_df.merge(continent_totals, on="WHO Region", suffixes=("", "_ContinentTotal"))


In [105]:
fig = px.choropleth(
    map_df,
    locations="Country/Region",
    locationmode="country names",
    color="Confirmed_ContinentTotal",
    hover_name="Country/Region",
    hover_data={"Confirmed": True, "WHO Region": True, "Confirmed_ContinentTotal": True},
    color_continuous_scale="Viridis",
    title=f"COVID-19 Confirmed Cases by Continent on {selected_date.date()}",
    projection="natural earth"
)

fig.show()
