In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

plt.style.use('seaborn-v0_8-whitegrid')

# Wrangle and Visualize Global COVID-19 Deaths 

This notebook contains a solution proposal to BAN405 mandatory assignment #2.

## Task 1: Data wrangling

First, we must import the data and wrangle it into a format that is suitable for visualizing the trends in COVID deaths in different countries over time.

**1. Load the COVID data set and explore the data.**

In [None]:
# Import covid data from url
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/refs/heads/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
df_covid = pd.read_csv(url)

# Drop lat and long (will not be used)
df_covid.drop(['Lat', 'Long'], axis = 1, inplace = True)

print(len(df_covid))
df_covid.head()

In [None]:
# Check data types of first 10 columns
# (location identifiers are strings and date columns are integers)
df_covid.iloc[:, :10].info()

In [None]:
# Check how many unique countries/regions
df_covid['Country/Region'].nunique()

In [None]:
# Check which unique values in country/region
df_covid['Country/Region'].unique()

In [None]:
# Check number of missings
# (appears that only province/state contains missings)
df_covid.isna().sum()

In [None]:
# Check rows with not missings in state/province
df_covid[df_covid['Province/State'].notna()]

In [None]:
# Check which countries have multiple state/provinces
df_covid[df_covid['Province/State'].notna()]['Country/Region'].unique()

In [None]:
# Check how many state/province these countries have
df_covid[df_covid['Province/State'].notna()].groupby('Country/Region').size()

In [None]:
# Check how many countries don't have multiple state/provinces
df_covid[df_covid['Province/State'].isna()]['Country/Region'].nunique()

**2. Reshape the data from wide to long so that dates are in a single column (i.e., tidy format).**

To have the dates in a single column instead of as column labels, we must reshape the data from wide to long using the pandas method `melt`. However, because some countries have multiple states/provinces, we must use both `Province/State` and `Country/Region` as identifier variables. We also assign the new columns in the DataFrame the labels `Date` and `Total_deaths`.

In [None]:
df_covid = df_covid.melt(
    id_vars = ['Province/State', 'Country/Region'],
    var_name = 'Date', # Label for the column containing the previous column labels
    value_name = 'Total_deaths' # Labe for the column containing the values in the previous columns
)

print(len(df_covid))
df_covid.head()

In [None]:
# Check dtypes in long df
df_covid.info()

In [None]:
# Check number of missings
df_covid.isna().sum()

**3. Convert dates to timestamps with the correct date format.**

Because the dates are objects (i.e., strings), we must change the data type to timestamps using the pandas method `to_datetime`. However, note that this method has a preference for dates that are in the format "YYYY-MM-DD", and it may therefore not infer the format correctly if the dates are in a different format. 

In our case, we must use the `format` parameter to specify that the dates are in the format "m/d/yy". See [strftime documentation](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior) for more information on choices in the format specifier.

In [None]:
df_covid['Date'] = pd.to_datetime(df_covid['Date'], format = '%m/%d/%y')

df_covid.head()

In [None]:
# Check that dates are now timestamps
df_covid.info()

In [None]:
# Check min and max dates (data starts on Jan 22, 2020 and ends on March 9, 2023)
print(df_covid['Date'].min())
print(df_covid['Date'].max())

**4. Aggregate the data to the country-level.** 

As some countries have multiple provinces or states, we must sum up the deaths across province/state in each country on each day. This will give us the total number of cumulative deaths on the country-level.

We can do this by using `groupby` in which we group the data for each country on each day, and then use the aggregation method `sum` to sum the number of deaths across provinces or states in each country. However, we first drop the column for the state or province to avoid that this operation also "sums" the strings in this column for those countries with multiple observations.

In [None]:
# Drop state/province
df_covid.drop('Province/State', axis = 1, inplace = True)

# Sum number of deaths for each country on each day 
# (use reset index to return the country and day column to the df)
df_covid = df_covid.groupby(['Country/Region', 'Date']).sum().reset_index()

print(len(df_covid))
df_covid.head()

In [None]:
# Check number of unique observations (i.e., days) for each country
# (all countries have the same number of days (1143) now)
df_covid.groupby('Country/Region').size()#.value_counts()

**5. Create a new column that contains the daily number of new deaths in each country on each day.**

Since the number of deaths are accumulated sums over time, we can calculate the new number of deaths for each day by simply subtracting the value on total deaths from the value observed on the previous day. We can do this for each country by using `groupby` and then apply the `diff` method on the column with total deaths.

However, note that we must first ensure that observations for each country are sorted according to dates.

In [None]:
# Sort rows according to dates for each country
# (note that data was already sorted correctly, but best to be on the safe side)
df_covid.sort_values(['Country/Region', 'Date'], inplace = True)

# Calculate number of daily new deaths for each country
df_covid['New_deaths'] = df_covid.groupby('Country/Region')['Total_deaths'].diff()

df_covid.head()

In [None]:
# Check rows with missing value on new deaths
df_covid[df_covid['New_deaths'].isna()]

In [None]:
# Check which dates are missing value on new deaths
# (see that this is only the case for the first observed date)
df_covid[df_covid['New_deaths'].isna()]['Date'].unique()

## Task 2: Data visualization

Now that we have a tidy DataFrame with COVID deaths (both total and new deaths), we can use the data to visualize deaths for different countries. We produce the following visualizations:

**1a) A single graph that contains line plots of total deaths over time for the three countries with the highest total number of COVID-19 deaths in the data.**


First, we have to identify which countries in our data experienced the highest number of total COVID-19 deaths. We can extract this information in several different ways, e.g., by summing the number of new deaths for each country, or by simply checking the total number of deaths on the last day in the sample for each country.

In [None]:
# Alt. 1: sum new deaths by country and check top values
df_covid.groupby('Country/Region')['New_deaths'].sum().sort_values(ascending=False)

In [None]:
# Alt. 2: extract last observed value on total deaths for each country
df_covid.groupby('Country/Region')['Total_deaths'].last().sort_values(ascending=False)

In [None]:
# Alt. 3: filter on last date and sort filtered df according to total deaths
df_last = df_covid[df_covid['Date'] == df_covid['Date'].max()].copy()
df_last.sort_values('Total_deaths', ascending = False, inplace = True)

df_last.head()

In [None]:
# Extract top three countries by indexing df
countries = df_last['Country/Region'][:3].values

countries

In [None]:
# Alternatively, use the nlargest method
countries = df_last.nlargest(3, 'Total_deaths')['Country/Region'].values

countries

Each of the three methods above shows the the three countries with the highest number of deaths were the US, Brazil and India. We now create a graph with line plots showing the acumulated number of deaths during the pandemic for each of the three countries.

In [None]:
fig, ax = plt.subplots(figsize = (8, 4))

for country in countries:
    # Extract observations for country (sort on dates)
    subset = df_covid[df_covid['Country/Region'] == country].sort_values('Date')
    
    # Create line plot of total deaths for country 
    ax.plot(
        subset['Date'],
        subset['Total_deaths'] / 1000, # scale to thousands to avoid large numbers on yaxis 
        label = country
    )
    # Add label the line with final number of deaths
    last_date = subset['Date'].iloc[-1]
    last_value = subset['Total_deaths'].iloc[-1] / 1000    
    ax.text(
        last_date,                         # location for text on xaxis
        last_value,                        # location for text on yaxis
        f'{last_value:.0f}k',              # format how number is displayed
        #color = ax.lines[-1].get_color()  # uncomment this to fix text color
    )

# Formatting
ax.set_xlim(df_covid['Date'].min())
ax.set_title('Top 3 Countries with the Highest Total COVID-19 Deaths')
ax.set_ylabel('Cumulative deaths (thousands)')
ax.legend(title = 'Country')

plt.savefig('plots/total_deaths.png', dpi = 500, bbox_inches = 'tight')

**1b) A figure with three subplots that show the daily number of new deaths for Norway, Denmark and Sweden**

In [None]:
fig, ax = plt.subplots(
    nrows = 3,        # 3x1 subplots
    ncols = 1, 
    figsize = (11, 9), 
    sharex = True # set this to False (or comment out) to add xaxis labels to all subplots
)

countries = ['Norway', 'Sweden', 'Denmark']

for i in range(len(countries)):
    # Extract observations for country (sort on dates)
    subset = df_covid[df_covid['Country/Region'] == countries[i]].sort_values('Date')

    # Create line plot of daily new deaths for country (in subplot)
    ax[i].plot(
        subset['Date'],
        subset['New_deaths'],
    )

    # Formatting of subplot
    ax[i].set_title(f'Daily new deaths in {countries[i]}')
    ax[i].set_ylabel('Daily deaths')
    ax[i].set_xlim(df_covid['Date'].min(), df_covid['Date'].max())

# Add title to figure
fig.suptitle('Daily New COVID-19 Deaths: Norway, Denmark, Sweden', fontsize = 14, weight = 'bold')
plt.tight_layout()  # makes suptitle space look better

plt.savefig('plots/new_deaths.png', dpi = 500, bbox_inches = 'tight')

**2. A reusable plotting function that displays a single line graph that contains the total number of deaths over time for one or more countries.**

For this task we can simply modify the code from task 1a by converting it into a function called `plot_total_deaths`. The function takes two inputs: a list of countries (`countries`) and a DataFrame (`data`) with the COVID-19 deaths; and it displays a single graph with line plots of the total number of deaths for the requested countries. As a default, the function will use the data stored in `df_covid`.

The assignment states that the function should be able to handle function calls in which the list of countries include locations not present in the data (e.g., Atlantis) and also the case where the list of countries is empty. There are several ways to deal with these scenarios. The chosen solution here is to use `if` statements to check that the list of countries is not empty and that at least one of the requested countries is present in the data. If that is not the case, then we use the `return` statement to break out of the function call. Otherwise, we go ahead with creating a line plot for the requested countries that are found in the data.

Note that the function could be furthered improved by ensuring that e.g., the DataFrame does in fact contain the necessary columns (`Country/Region`, `Total_deaths`, `Date`), the dates are timestamps,  and the selected countries are passed as strings inside a list or a list-like object (e.g., tuple).

In [None]:
def plot_total_deaths(countries, data = df_covid):

    # Check that at least one country has been selected
    if len(countries) == 0:
        print('No countries provided. Please supply at least one country name.')
        return # use return statement to break function call no countries have been selected

    # Filter data for selection of countries and check that data is not empty
    df_temp = data[data['Country/Region'].isin(countries)]
    if len(df_temp) == 0:
        print('None of the requested countries were found in the dataset.')
        return # break function call if data does not contain any of the countries

    # Create a line plot for countries present in filtered data 
    fig, ax = plt.subplots(figsize = (8, 4))
    
    for country in df_temp['Country/Region'].unique():
        # Extract observations for country (sort on dates)
        subset = df_temp[df_temp['Country/Region'] == country].sort_values('Date')
        
        # Create line plot of total deaths for country 
        ax.plot(
            subset['Date'],
            subset['Total_deaths'],
            label = country
        )
    
    # Formatting
    ax.set_xlim(df_temp['Date'].min(), df_temp['Date'].max())
    ax.set_title('Cumulative Number of COVID-19 Deaths')
    ax.legend()
    
    # Extra: instead of scaling, add thousand seperator to numbers on yaxis
    # (method from matplotlib.ticker submodule)
    ax.yaxis.set_major_formatter(ticker.StrMethodFormatter('{x:,.0f}'))
    
    plt.show()

In [None]:
plot_total_deaths(['Norway', 'Denmark', 'Sweden'])

In [None]:
plot_total_deaths(['Norway', 'Denmark', 'Atlantis'])

In [None]:
plot_total_deaths(['The moon', 'Mars', 'Atlantis'])

In [None]:
plot_total_deaths([])

## Task 3: Data merging

Finally, we want to explore COVID-19 deaths between the different parts of the world. To do this we need to merge our COVID data with a new data set ("Countries Continents.csv") that contains data on which continent each country belongs to. The data set is from Our World in Data and it can be imported directly from their GitHub repo [here](https://raw.githubusercontent.com/owid/owid-datasets/refs/heads/master/datasets/Countries%20Continents/Countries%20Continents.csv).

**1. Load and explore the new data set.**

In [None]:
url = 'https://raw.githubusercontent.com/owid/owid-datasets/refs/heads/master/datasets/Countries%20Continents/Countries%20Continents.csv'
df_owid = pd.read_csv(url)

print(len(df_owid))
df_owid.head()

In [None]:
# Check data types
df_owid.info()

In [None]:
# Check number of missings
df_owid.isna().sum()

In [None]:
# Check number of unique countries and the unique values
print(df_owid['Entity'].nunique())
print(df_owid['Entity'].unique())

In [None]:
# Check unique continents
df_owid['Countries Continents'].unique()

In [None]:
# Check number of countries per continent
df_owid.groupby('Countries Continents').size()

**2. Merge COVID data with continent information**

Before we can merge the two data sets, we need to ensure that the common columns that we will merge on have the same column labels. We also need to drop the `Year` column from the OWID data set as we do not want this information to appear in our merged data.

In [None]:
df_owid.drop('Year', axis = 1, inplace = True)
df_owid.columns = ['Country', 'Continent']

df_owid.head()

In [None]:
# Use rename to shorten the country column in the COVID data as well
df_covid.rename(columns = {'Country/Region' : 'Country'}, inplace = True)

df_covid.head()

According to the assignment, there are some slight name variations between the two data sets. To explore this, we perform a *left* join to see how many of the countries in the COVID data that we are able to identify in the OWID data.

In [None]:
df_test = df_covid.merge(df_owid, on = 'Country', how = 'left')

df_test.head()

In [None]:
# Check how many missing values in new continent column
df_test.isna().sum()

In [None]:
# Check which countries have missing continent information
df_test[df_test['Continent'].isna()]['Country'].unique()

From our exploration, we see that some of the missing observations on continent information is due to the observation not being an actual country (e.g., "Winter Olympics 2022"), and we can safely ignore these missing values. 

However, other observations (e.g., "US") is most likely due to name variations between the two data sets. From inspecting the OWID data, we find that they label these countries as the following:

- "Burma" is "Myanmar"
- "South, Korea" is "South Korea"
- "North, Korea" is "North Korea"
- "US" is "United States"
- "Taiwan*" is "Taiwan"
- "West Bank and Gaza" is "Palestine"
- "Congo (Brazzaville)" and "Congo (Kinshasa)" is "Congo"

To improve the merge quality between our two data sets, we will update the country names in the COVID data for these countries (we ignore the rest for now).

First, we define a dictionary in which the keys are the old names and the values are the new names. We can then use the `loc` attribute to update the country names in the COVID data with the new names. 

In [None]:
country_d = {
    'Burma' : 'Myanmar',
    'US': 'United States',
    'West Bank and Gaza' : 'Palestine',
    'Korea, North' : 'North Korea',
    'Korea, South' : 'South Korea'
}

# Update all the old country names defined in the dictionary
for key in country_d:
    df_covid.loc[df_covid['Country'] == key, 'Country'] = country_d[key]

In [None]:
# Alternatively, we could have used the pandas method "replace", which
# replaces values in a column based on a mapping from a dictionary
# df_covid['Country'] = df_covid['Country'].replace(country_d)

For the remaining countries, we can use string methods to remove unwanted characters (*) at the end of the name and to drop everything in parenthesis.

In [None]:
# Remove "*" at the end of the name so that Taiwan* becomes Taiwan
df_covid['Country'] = df_covid['Country'].str.rstrip('*')

# Split on "(" and keep only first part so that "Congo (Brazzaville)" becomes Congo
df_covid['Country'] = df_covid['Country'].str.partition('(')[0].str.strip()

However, because we now have two observations for Congo on each day, we must sum the data across countries for each day.

In [None]:
# df_covid.groupby(['Country', 'Date']).size().sort_values(ascending = False)

In [None]:
df_covid = df_covid.groupby(['Country', 'Date']).sum().reset_index()

We can now re-merge the two data sets. However, note that we still use a left join so that we can check that we are now able to merge more countries than before.

In [None]:
df_merge = df_covid.merge(df_owid, on = 'Country', how = 'left')
df_merge.head()

In [None]:
# Check which countries that have missing continent info
df_merge[df_merge['Continent'].isna()]['Country'].unique()

In [None]:
# Check that we have the same number of daily obserations for all countries
df_merge.groupby('Country').size()

In [None]:
# Potentially: drop all observations with missing continent info
# df_merge = df_merge.dropna('Continent', axis = 1, inplace = True)

**3. Calculate the total number of COVID deaths per continent**

By summing the number of deaths by continent, we see that it was Europe that had the most number of COVID deaths in the data.

In [None]:
# Sum number of daily deaths by continent
df_merge.groupby(['Continent'])['New_deaths'].sum().sort_values().reset_index()

In [None]:
# Alternatively, sum number of total deaths by continent using only the last day
# df_merge[df_merge['Date'] == df_merge['Date'].max()].groupby('Continent')['Total_deaths'].sum()

**4. Create a bar plot of total number of COVID deaths per continent.**

In [None]:
# Create pandas series with continent totals
totals_series = df_merge.groupby(['Continent'])['New_deaths'].sum().sort_values()
totals_series

In [None]:
with plt.style.context('ggplot'):
    
    fig, ax = plt.subplots(figsize = (9, 4))

    # Bar plot of total deaths per continent
    ax.bar(totals_series.index, totals_series.values, edgecolor = 'black')

    # Formatting
    ax.set_title('Total COVID-19 Deaths by Continent', fontsize = 14, weight = 'bold')
    ax.yaxis.set_major_formatter(ticker.StrMethodFormatter('{x:,.0f}'))
    
    plt.savefig('plots/continent_bar_plot.png', dpi = 500, bbox_inches = 'tight')

In [None]:
# Note: can create bar plot using pandas plotting method
totals_series.plot(
    kind = 'bar', 
    figsize = (9, 4), 
    xlabel = '',
    title = 'Total COVID-19 Deaths by Continent',
    #rot = 0 # uncomment to have xlabels horizontal
)

plt.show()

**5. Create a stacked bar chart of total number of COVID deaths per continent per year.**

To create the stacked bar plot, we must first calculate the number of COVID deaths for each continent in each year. To group the data on year, we use the `dt` accessor to create a new column that inidcates the year of the observation (note that alternatively we could have used the `resample` method).

In [None]:
# Create copy of merged data and add year column
df_year_totals = df_merge.copy()
df_year_totals['Year'] = df_year_totals['Date'].dt.year

# Calculate sum of new deaths for each continent in each year
df_year_totals = df_year_totals.groupby(['Continent', 'Year'])['New_deaths'].sum().reset_index()

df_year_totals.head()

Although it is possible to create a stacke bar chart with matplotlib, this is one of the (few) cases where it is actually easier to do it by using pandas plotting methods. Specifically, we use the `plot` method from pandas, and set `kind='bar'`. 

As a default, the pandas method `plot` will use the index as the values on the x-axis, and the values in the columns on the y-axis. Therefore, we first need to use `pivot` to reshape the data so that we have the continent as the index and the columns show the totals for a specific year.

In [None]:
df_pivot = df_year_totals.pivot(
    columns = 'Year',     # Create a new column for each value in "Year" column
    index = 'Continent',  # Use values in "Continent" column as the new index
    values = 'New_deaths' # Populate new columns with the values in "New_deaths"
)

df_pivot

We can now apply the `plot` method on the pivoted DataFrame. Note that because the DataFrame contains multiple columns (one for each year), pandas will plot the year observations side-by-side for each continent (unless we specify which column to use).

In [None]:
df_pivot.plot(
    # y = 2022,    # uncomment to select a specific year
    kind = 'bar',
    figsize = (9, 4),
    rot = 0
)

plt.show()

To get a *stacked* bar plot, we can simply set `stacked=True` in the function call to `plot`.

In [None]:
df_pivot.plot(
    kind = 'bar',
    stacked = True,
    figsize = (9, 4),
    rot = 0,
    title = 'Total COVID-19 Deaths by Continent (stacked by year)',
)

plt.show()

Note that the values on the y-axis are not very nice-looking. We can fix this by scaling the values in the DataFrame (e.g., divide by a 1000). Alternatively, if we want to be able to set the major tick formatter for the y-axis (which is an `Axes` method), we can use a trick in which we store the output of the function call to `plot` in a variable called `ax`.

In [None]:
ax = df_pivot.plot(
    kind = 'bar',
    stacked = True,
    figsize = (9, 4),
    rot = 0,
    title = 'Total COVID-19 Deaths by Continent (stacked by year)',
)

# Format values on yaxis
ax.yaxis.set_major_formatter(ticker.StrMethodFormatter('{x:,.0f}'))

# Remove title in legend and add frame
ax.legend(title = None, frameon = True)

plt.savefig('plots/continent_stacked_plot.png', dpi = 500, bbox_inches = 'tight')