# Worksheet 4 - Scientific Visualisation MVE080/MMG640
## Part 1: Uncertainty and storytelling with data

Names of all group members:

- Daniel Esteban Lahti
- Ana Camila Jimenez Mendoza
- Edin Ahmemulic
- Filip Keerberg

This is the first worksheet in the course *Scientific Visualisation*. This Jupyter notebook has three functions:

1. It describes the tasks.
2. It (sometimes) provides coding templates that you can use as a basis for your own code.
3. It is a template also for the report that you upload in Canvas.

The tasks are of various types: some are to read some text and then comment on it (no coding), and some are about creating visualisations using plotnine. Once you have finished all the tasks, export this document as an HTML file and upload it in Canvas. 

The goal of these homeworks is to learn how to improve your skills in visualising your science. You solve the homeworks in groups, however annotate all the code (even the theoretical parts) with who solved each question. **Eventhough the homework is submitted as a group you will be *individually* evaluated**. Motivate your choice of graph, legend, colourmap etc below your graph in a separate cell. 

Notice that Jupyter notebooks use [Markdown](https://docs.github.com/en/github/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax#links) for writing text cells. Make sure you understand the basics. 

Throughout the assignment you shall use a Python workflow.
If you are completely new to Python, take a look at [this page](pythonbasics.org).
Python can do essentially all that MATLAB can, plus more. 
In this course we shall use Python in different contexts, starting with the [Jupyter Notebook interface](https://jupyter-notebook.readthedocs.io/en/stable/notebook.html). 

Matrices and arrays are handled through the NumPy module. [Learn here](https://docs.scipy.org/doc/numpy/user/numpy-for-matlab-users.html) how NumPy is different from MATLAB.

The below loads the packages required for this homework.

In [13]:
import numpy as np
import pandas as pd
import geopandas as gpd
from plotnine import *
# Currently in a plotine dependancy they have deprication warning, so 
# we mute warnings to have a better experience
import warnings
warnings.filterwarnings("ignore")

# A nice color palette for categorical data 
cbPalette = ["#E69F00", "#56B4E9", "#009E73", 
             "#F0E442", "#0072B2", "#D55E00", 
             "#CC79A7", "#999999"]

## Task 1

Read Chapter 16 of [Fundamentals of Data Visualizations](https://clauswilke.com/dataviz/), then answer the questions below.

### Question 1.1

During Lecture 6, we visualised uncertainty for point estimates using i) graded error bars and ii) fuzziness (slide 7). Briefly describe when each kind of visual is suitable.

### Answer 1.1
_Your answer here_

### Question 1.2

Another way to visualise uncertainty is a **Hypothetical outcome plots** (Chapter 16 in [Fundamentals of Data Visualizations](https://clauswilke.com/dataviz/)). Briefly describe the key features of such a visual.

### Answer 1.2
_Your answer here_

### Question 1.3

Often when doing regression (fitting a curve to data), we want to use the regression model to make predictions. What is important to think about when visualising model predictions?

### Answer 1.3
_Your answer here_


## Uncertainty  

For this part reading chapter 16 in [Fundamentals of Data Visualizations](https://clauswilke.com/dataviz/) can help.

### Question 2.1

Coffee is undoubtedly a popular beverage. On Canvas, I have uploaded a dataset with ratings for coffee beans from six different countries. 

Here, we are interested in identifying which country on average has the best quality beans. Create two plots, where in i) you plot the mean ($\mu$) and standard deviation for each country (using error bars), and ii) where you plot the mean and standard error of the mean estimate. Briefly discuss the drawbacks with error bars. 

As a reminder. For country $j$ with coffee ratings $x^{(j)}_1, \ldots, x^{(j)}_{n_j}$ the sample mean is given by 

$$
\hat{\mu}^{(j)} = \sum_{i = 1}^{n_j} x_i^{(j)},
$$

sample standard deviation by

$$
\hat{\sigma}^{(j)} = \sqrt{\frac{1}{n - 1} \sum_{i = 1}^{n_j} (x_i^{(j)} - \hat{\mu}^{(j)})^2},
$$

and sample standard error for $\hat{\mu}^{(j)}$

$$
\hat{\sigma}_{\hat{\mu}^{(j)}} = \frac{\hat{\sigma}^{(j)}}{\sqrt{n}}
$$

and a confidence interval with confidence level $1 - \alpha$ is given by;

$$
\hat{\mu}^{(j)} \pm t_{n-1}(1 - \alpha / 2) \hat{\sigma}_{\hat{\mu}^{(j)}}
$$

where $t_{n-1}$ is the t-distribution with $n-1$ degree of freedom. For confidence level $1 - \alpha=0.95$ and $n = 30$ we have that $t_{29}(1 - 0.025) \approx 2.05$.

In [14]:
# Insert code here for visual 1
# Remember to print the visual 

In [15]:
# Insert code here for visual 2
# Remember to print the visual 

*Insert brief discussion*

### Question 2.2

Now using the coffee dataset, visualise the uncertainty in the mean estimate using i) graded error bars (here, plot confidence intervals with 80%, 90% and 99% confidence) and ii) fuzzy error bars (as on slide 7 in Lecture 6).

In [16]:
# Insert code here for visual 1
# Remember to print the visual 

In [17]:
# Insert code here for visual 2
# Remember to print the visual 

### Question 2.3

Frequency graphs are a powerful tool for visualising probabilities. On Canvas, I have uploaded an image of such a graph. Recreate it (you do not have to recreate the colours perfectly). The figure might look strange when rendered in Jupyter (then, it is better to save it to disk and see how it looks like),

In [18]:
# Insert code here 
# Remember to print the visual 

### Question 2.4

There are different ways to visualise curve fits. One is the fan plot (see lecture 6). On Canvas, I have uploaded a fan plot and associated dataset; use this dataset to recreate the visual.

**Note** - Hypothetical outcomes are generated via bootstraping, for computational speed it might be worthwhile to downsample the hypothetical outcomes.

In [19]:
# Insert code here 
# Remember to print the visual 

## Part 2: Storytelling with data

## Storytelling with data

For this part reading chapter 20 in [Fundamentals of Data Visualizations](https://clauswilke.com/dataviz/) can help. An important part for making people interact with your visual is that it looks good, therefore, for this part I will also judge the aesthetics of the visual, as well as how efficiently it conveys its main message.

### Question 1.1

To make a visual accessible it is important to think like a designer - and a key feature here is to cleverly use highlighting and text annotations. An example of a well-designed visual is the Goal-attainment visual shown in the lecture, to practice annotations recreate it. The dataset is on Canvas. 

In [4]:
# Insert code here 

# Remember to print the visual 

**Briefly comment on why red is a good choice of colour in this visual.**

_Brief motivation_

### Question 1.2

During the lecture, I gave the Spanish unemployment data a long needed makeover - specifically I made it accessible for a wider audience. Choose any visual from the slides in lecture 1-6, decide upon a point you want to make, and use the lessons from the Story telling lecture to make that point accessible. 

Provide motivation to how the changes to the visual has made it more accessible.

In [5]:
# Insert code here 
# Remember to print the visual 

_Brief motivation_

### Question 1.3

On Canvas I have uploaded a made-up pie chart example from a summer pilot program, where student prior and after the program were asked how excited they were about doing science. The program was a success, but, a pie-chart is not a good way to visualize the data. Using lessons from the lecture, improve the visual, such that a wide audience can identify that the program was a success. 

In the new visual make use of proper highlighting, annotations, a good title, and choose a good form of visualisation. The new visual should include all the data used in the pie chart.

In [6]:
# Insert code here 
# Remember to print the visual 

### Question 1.4

Clutter is something we want to remove. On Canvas I have uploaded a visual on performance index for *Our Business* compared to other companies. Improve this visual such that it becomes easy to see how *Our Business* compares against the other companies for each category.

Provide a brief motivation to why your new visual is accessible.

In [7]:
# Insert code here 
# Remember to print the visual 

## Tables

For this part reading chapter 22 in [Fundamentals of Data Visualizations](https://clauswilke.com/dataviz/) can help. 


### Question 2

A well-designed table may be a good tool to report exact amount of every value of your data. Following the lecture slides and the course book, redesign the table below to make it readable and easy to understand.

![Table_homework5.png](attachment:f6d3c1cf-e5bd-41c5-b06c-220516632164.png)

Taking a look at the table source (https://medium.com/@dexter.shawn/how-uc-berkeley-almost-got-sued-because-of-lying-data-aaa5d641f571) may help you in your decisions upon completing this task. 

You may use any software (except for ChatGTP and other AI tools). Desribe what you changed and why.

In [8]:
# Insert table here

_Insert answer here_