<center><a target="_blank" href="https://academy.constructor.org/"><img src=https://lh3.googleusercontent.com/d/1EmH3Jks5CpJy0zK3JbkvJZkeqWtVcxhB width="800" style="background:none; border:none; box-shadow:none;" /></a> </center>

_____

# <h1 align="center"> Exercises: Design & Storytelling </h1>

<p style="margin-bottom:1cm;"></p>

_____

<center>Constructor Academy, 2025</center>

# Exercises: Design & Storytelling


The exercises for Day 2 are practice techniques for telling stories with data.

## Table of contents

*  Exercise 1: Layering
*  Exercise 2: More Layering
*  Exercise 3: Faceting
*  Exercise 4: Anscombe's Quartet
*  Exercise 5: Even More Layering
*  Exercise 6: Presentation


## Preamble

In [None]:
# import packages
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# use seaborn style
sns.set()
sns.set_style('darkgrid')

## Read in data

**SNB Money-market Rate Data**

Data on the money-market rates over various terms, including: overnight (`SARON`), call money rate (`1TGT`), 1, 3, 6, and 12-month CHF-denominated loans (`1M`, `3M0`, `6M`, `12M`, respectively), and USD, JPY, GBP, and EUR-denominated loans (`3M1`, `3M2`, `3M3`, and `3M4`, respectively)

Used in Ex 1

In [None]:
snb_df = pd.read_csv("../../data/snb-data-zimoma-en-all-20200901_1437.csv", sep=";", skiprows=2)
snb_df = snb_df.rename({'D0': 'Instrument'}, axis=1)
snb_df['Date'] = pd.to_datetime(snb_df['Date'])
snb_df = snb_df.set_index(['Date', 'Instrument']).unstack()['Value']
term_order = ['SARON', '1TGT', '1M', 'EG3M', '3M0', '3M1', '3M2', '3M3', '3M4', '6M', '12M']
snb_df = snb_df[term_order]
snb_df.head()

**MPG Data**

Data about a selection of automobiles in two years: 1999 and 2008. Includes information about the manufacturer and model of each car, as well as data on the type of car (`class`), the size of the engine (`displ`, `cyl`), the type of transmission (`trans`), and city and highway fuel efficiency (`cty`, `hwy`) in miles/gallon of fuel units.

Used in Ex 2, 3

In [None]:
mpg_df = pd.read_csv("../../data/mpg.csv")
mpg_df.head()

**Anscombe's Quartet**

Synthetic data with some strange properties, made up of four data sets. Contains three columns, `dataset`, to identify which dataset each row belongs to, and `x`, and `y` values.

Used in Ex 4

In [None]:
anscombe_df = pd.read_csv('../../data/anscombe.csv')
anscombe_df.head()

**Mortality Data**

Data on the daily number of deaths in France from 1 Jan, 2000 to 18 May, 2020. The columns `month` and `day` are the month and day of the data, the columns 2000-2020 are the data for those years.

Used in Ex 5

In [None]:
mort_df = pd.read_csv('../../data/morts_2020-05-18.csv')
month_day_df = mort_df['mois_jour'].str.split("/", expand=True)
month_day_df.columns = ['month', 'day']
month_day_df = month_day_df.astype('int32')
mort_df = month_day_df.join(mort_df)
# Get rid of Feb 29, we can ignore it
mort_df = mort_df.dropna(axis=0, subset=['2001'])
mort_df.head()

## Ex. 1: Layering

Layering can be used to provide context necessary for interpreting data. We are going to take the line chart from yesterday and layer context onto the chart.

### Ex. 1.1

The snb_df includes a `1TGT` column for of rates for tomorrow-next loans. Plot a line chart of this column, and layer the following events and data on top of the chart.

* 1973-11: Oil Price Shock
* 1991-01 – 1993-12: Recession in Switzerland
* 2009-01 – 2009-12: Recession in Switzerland
* 2008-09: Lehman-Brothers collapse
* 2011-08: Introduction of CHF/EUR floor
* 2015-01: Removal of CHF/EUR floor

(Matplotlib documentation for [annotations](http://matplotlib.org/users/annotations.html) and [axvspan](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.axvspan.html) will be helpful here.)


In [None]:
# Here is the initial plot to get you started
# Add the events to this plot
fig, ax = plt.subplots(figsize=(12, 8))
ser = snb_df['1TGT']
ser.plot(ax=ax)
ax.set_title("Swiss Call Money Rate")
ax.set_xlabel("")
ax.set_ylabel("Yield (%)");

In [None]:
# code your solution here

### Ex 1.2

Save the plot made in 1.1 as a PDF file.

In [None]:
# code your solution here

### Ex 2: More Layering

### Ex 2.1

For this exercise, we will use the `mpg_df` dataframe. We will start with a scatterplot with `displ` on the x-axis and `hwy` on the y-axis.

For the car or cars with the best and worst highway fuel milage, plot the data points observing:
- Use [Brewer Set1](https://colorbrewer2.org/#type=qualitative&scheme=Set1&n=7) green for the best (#4DAF4A) and the red for the worst (#E41A1C)
- Make the dot size larger (10 points, s = 100)
- Label the point with manufacturer(s) of automobile (column `manufacturer` in the dataframe)


In [None]:
# code your solution here

Step 1, identify the cars with the best and worst fuel mileage, and their manufacturers

In [None]:
# try to solve this yourself. If you are stuck, uncomment the next line and execute this cell
# %load fragment-2.1.1.py

Step 2, plot the full plot together with the best/worst cars and labels for the manufacturers

In [None]:
# code your solution here

### Ex 2.2

Starting with the plot from 2.1, draw a linear model of the relationship between displacement and highway mileage in Brewer Set1 orange (#ff7f00) over the plot. Label the line with the $r^2$ of the model.

Step 1, build a linear model of `hwy` as a function of `displ`

In [None]:
import statsmodels.formula.api as smf

def fit_and_predict(df):
    # fit a model explaining hwy fuel mileage through displacement
    lm = smf.ols(formula="hwy ~ displ", data=df).fit()

    # find two points on the line represented by the model
    x_bounds = [df['displ'].min(), df['displ'].max()]
    preds_input = pd.DataFrame({'displ': x_bounds})
    predictions = lm.predict(preds_input)
    return lm, pd.DataFrame({'displ': x_bounds, 'hwy': predictions})


lm, pred = fit_and_predict(mpg_df)
rsquared = lm.rsquared

Step 2, draw the plot and add a line of the model

In [None]:
# code your solution here

## Ex. 3: Faceting

Faceting can be used to show more data, provide context, and make a visualization easier to undersatand.

### Ex 3.1

Using the mpg_df dataframe, scatterplot `displ` on the x-axis and `hwy` on the y-axis, facet by class of car (the `class` column). Give the figure the title (`suptitle`), `Engine Size vs. Highway Fuel Mileage`, and label the y-axis `MPG` and x-axis `Displacement (Liters)`.

In [None]:
# code your solution here

### Ex 3.2

Make the same plot as in 3.1, but this time **sort the facets** by the mean displacement for each class.

In [None]:
# code your solution here

### Ex 3.3

Make the same plot as in 3.2, but this time highlight the car or cars with the best and worst fuel mileage, as was done in 2.1, and draw in the linear model from 2.2. That is:

For the car or cars with the best and worst highway fuel milage, plot the data points observing:
- Use Brewer Set1 blue for the best (#377EB8) and the red for the worst (#E41A1C)
- Label the point with model(s) of automobile (column `model` in the dataframe)

And draw a linear model of the relationship between displacement and highway mileage over the plot (do not add the $r^2$ as text).

In [None]:
# code your solution here

## Ex 4: Anscombe's Quartet

This exercise works with anscombe_df.

### Ex 4.1

There are 4 data sets in the frame (indicated by the dataset column): what is the mean x, and mean y for each data set?

In [None]:
# code your solution here

### Ex 4.2

Compute a linear regression for each data set. What is the slope and intercept?

Make a faceted plot, one subplot for each dataset, drawing the points at their specified x/y values. Layer the regression lines over the plot. Label the regression lines with slope, intercept, and r^2.

In [None]:
# code your solution here

### Ex. 4.3

*How appropriate is the linear model for each data set?*

A linear model makes sense for data set I, but is questionable for the others. It might make sense for III, but the strange outlier would need to be accounted for. II looks like it would be better modelled with a quadratic function, and something very different is going on in IV.

## Ex 5: Even More Layering

Let us recreate this fantastic graphic using the `mort_df`. For more background, see the following (all in French):
* http://coulmont.com/blog/2020/04/24/2020-une-mortalite-specifique/
* https://freakonometrics.hypotheses.org/60845
* https://www.lemonde.fr/les-decodeurs/article/2020/04/27/coronavirus-un-pic-tres-net-de-mortalite-en-france-depuis-le-1er-mars-par-rapport-aux-vingt-dernieres-annees_6037912_4355770.html

![COVID-19 in FR](http://coulmont.com/vordpress/wp-content/uploads/2020/04/deces-2001-2020-blog-general.png)

* Plot the day number in year (going from 0 to 364 or 365) on the x-axis, and the number of deaths for that each year on the y-axis.

Step 1, find the boundries for the month in the data

In [None]:
import datetime
days_in_month = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
month_cutoff = np.array([0] + days_in_month).cumsum()[:-1]
month_names = [datetime.date(2020, i+1, 1).strftime("%h") for i in range(len(days_in_month))]

Step 2, plot one year of data to see what it looks like

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))
ax.plot(mort_df.index, mort_df['2000'], alpha=0.5)
ax.set_xlim([0, 365])
ax.set_xticks(month_cutoff)
ax.set_title("Number of Daily Deaths in France, 2000 - 2020")
ax.set_xticklabels(month_names);

Step 3, plot the remaining years of data, up to 2019

In [None]:
# code your solution here

Step 4, plot the 2020 data in red and label it as the `Year 2020`

In [None]:
# code your solution here

Step 5
* label the big spike that goes over 3000 as `Heatwave, 2003`
* compute the average number of deaths per day, and layer a line with the average
* label the average line `Mean deaths, 2000-2019`

In [None]:
# code your solution here

## Ex 6: Presentation

Using data from an earlier part of the course, make a PowerPoint (Keynote, Google Slides, etc.) presentation. Using what you have learned, see if you can improve any visualizations you might want to show, export them to PDF or PNG and embed them in a slide deck.

Think about choice of typeface, layout of slides.

(
If you have nothing, use the Call Money Rate example. These links can provide context:
- https://en.wikipedia.org/wiki/Money_market
- https://www.global-rates.com/en/interest-rates/central-banks/central-bank-switzerland/snb-interest-rate.aspx
)