# Worksheet 1 - Scientific Visualization MVE080/MMG640
## Basics in Python, Jupyter, plotnine and how to visualize amounts

Name: _Your Name_

This is the first worksheet in the course *Scientific Visualization*. This Jupyter notebook has three functions:

1. It describes the tasks.
2. It (sometimes) provides coding templates that you can use as a basis for your own code.
3. It is a template also for the report that you upload in Canvas.

The tasks are of various types: some are to read some text and then comment on it (no coding), and some are about creating visualizations using plotnine.
Once you're finished with all the tasks, export this document as an HTML-file and upload it in Canvas.
You are encouraged to discuss problems and solutions with your fellow students (in the class-room but also on CampusWire), but each student must solve all tasks by themselves and hand-in their own report.
Notice that Jupyter notebooks use [Markdown](https://docs.github.com/en/github/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax#links) for writing text cells. Make sure you understand the basics. Later on you can also include $\LaTeX$ in your Markdown cells.

Throughout the assignment you shall use a Python workflow.
If you are completely new to Python, take a look at [this page](pythonbasics.org).
Python can do essentially all that MATLAB can, plus more. 
In this course we shall use Python in different contexts, starting with the [Jupyter Notebook interface](https://jupyter-notebook.readthedocs.io/en/stable/notebook.html). 

Matrices and arrays are handled through the NumPy module. [Learn here](https://docs.scipy.org/doc/numpy/user/numpy-for-matlab-users.html) how NumPy is different from MATLAB.

The below loads the packages required for this homework.

In [23]:
import numpy as np
import pandas as pd
from plotnine import *
# Currently in a plotine dependancy they have deprication warning, so 
# we mute warnings to have a better experience
import warnings
warnings.filterwarnings("ignore")

# A nice color palette for categorical data 
cbPalette = ["#E69F00", "#56B4E9", "#009E73", 
             "#F0E442", "#0072B2", "#D55E00", 
             "#CC79A7", "#999999"]

## Task 1

Read Chapter 1-3 of [Fundamentals of Data Visualizations](https://clauswilke.com/dataviz/), then answer the questions below.

### Question 1.1
Describe the visualization concept _scales_. 

### Answer 1.1
The concept of scales in data visualization refers to the foundational principle of mapping data values to aesthetic elements, such as size,color and position.
Each scale has a one-to-one correspondence, ensuring that every single data value is paired with a unique aesthetic attribute, and each aestetic attribute corresponds to one, and ony one, data value.

### Question 1.2
In Figures 2.3 and 2.4 the same data is visualized in two different ways. Discuss the pros and cons of the two approaches. Which one do you prefer and why?

### Answer 1.2
Figure 2.3: Line Graph

-Pros:

    - Clear trends: A line graph displays clear temperature trends over time, makeing it easy to see the rise and fall of temperature throughtout the year for the different locations.

    - Easy comparrisons: A line graph allows for easy compastsion between locations at any specific point in time, as we can quickly compare the heights of the lines.

-cons:

    - Overlapping: Lines can overlap, which can make it difficult to destinguish between data points

Figure 2.4: Heat Map

-Pros:

    - Visual clarity: The heatmap uses color intensities to represent temperature, which makes it easy to immediatly identify areas of high and low temperatures.  
    
-cons:

    - Hard to compare: It is hard to compare precise temperatures between locations, as we must look at the color shade and then refer back to the color scale to get an aproximation of the exact temperature

### Question 1.3
Describe situations when _nonlinear axes_ might be useful.
When should they not be used?

### Answer 1.3
_Your answer here_

### Question 1.4
In which situations could a _polar coordinate system_ be useful? 

### Answer 1.4
_Your answer here_


## Task 2 - Tidy data, ggplot and distributions

Several graphics libraries such as ggplot2 and plotnine are at their best when the provided data is tidy. However, data is often not provided in a tidy format, hence being able to transform non-tidy data into tidy data is a crucial skill. 

### Question 2.1 

In the lecture I provided a small non-tidy dataset (code below).

In [24]:
data1 = pd.DataFrame({"Site" : ["Stockholm", "Gothenburg", "London"], 
                      "1999" : [13, 85, 77], 
                      "2000" : [21, 31, 15]})
data1



Unnamed: 0,Site,1999,2000
0,Stockholm,13,21
1,Gothenburg,85,31
2,London,77,15


Transform this small dataset into a tidy dataset. Print the table below (as I did above). 

*Hint* [melt](https://pandas.pydata.org/docs/reference/api/pandas.melt.html)

In [25]:
# Write you answer here

tidy_data = pd.melt(data1, id_vars="Site", var_name="Year", value_name="Measurement")
sorted_data = tidy_data.sort_values(by=["Site", "Year"])

print(sorted_data)


         Site  Year  Measurement
1  Gothenburg  1999           85
4  Gothenburg  2000           31
2      London  1999           77
5      London  2000           15
0   Stockholm  1999           13
3   Stockholm  2000           21


### Question 2.2

Datasets are often bigger than above. I have made a subset of the weather data used in the lecture non-tidy (available on the webpage). 

Transform this dataset into a tidy dataset. Print the table below (as I did above)

In [31]:
data2 = pd.read_csv(r'C:\Användare\David\MVE080\Data\2.2 Weather_not_tidy.csv')
print(data2)


FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Användare\\David\\MVE080\\Data\x02.2 Weather_not_tidy.csv'

## Distributions

For this part reading Chapter 7 and 9 in [Fundamentals of Data Visualizations](https://clauswilke.com/dataviz/) helps.

## Question 3.1

The benefit with tidy data is that it is easy to work with. For this task use the tidy weather dataset from above and choose three months and by **a density and histogram plot** visualize the differences in temperature between 2009 and 2010 for the chosen months in a readable manner. Make sure to print the visuals below. 

In [None]:
# Insert code here for histogram

# Remember to print the visual 

In [None]:
# Insert code here for density plot

# Remember to print the visual 

Provide a brief motivation for which plot (density and histogram) you find most readable.



_Your answer here_

### Question 3.2

In Lecture2 I used boxplots, error-bars, and violin plots to visualize fluctuations in winter temperature in Västerås. Now using the weather data above plot the temperature per month (like in Fig. 9.8 [here](https://clauswilke.com/dataviz/boxplots-violins.html)) using error-bars, boxplots and violin plots with data points. For each month plot the temperature for 2009 and 2010 next to each other (see example on webpage). 

In [None]:
# Insert code here using standard error

# Remember to print the visual 

In [None]:
# Insert code here using boxplots

# Remember to print the visual 

In [None]:
# Insert code here using violin plot with dots

# Remember to print the visual 

Briefly discuss the drawback with each approach (standard errors, boxplots, and violin plots). 

__Write answer here__

One of the drawbacks with a violin plot is that it can be hard to read summary statistics from it. Produce a violin plot visual here, where you have also drawn the median for each violin, as well as the data points. 

In [None]:
# Insert code here using violin plot with dots and median line

# Remember to print the visual 

Briefly discuss how you could make it easy for the reader to understand that the line the visual represent the median.

__Write answer here__

### Question 3.3

Some people argue that visuals should be as minimal as possible. On the webpage you have a bad example of a minimalist way to plot the min, median, and max temperature of each month in 2009 and 2010 using the dataset in Question 3.2 (where you hopefully have filtered out daily temperatures for 2009 and 2010). By **only** using geom\_point recreate the visual (you do not need to recreate the title). **You are not allowed to transform the dataset**.

*Hint - Remember the stat argument in ggplot*

In [None]:
# Insert code here 
# Remember to print the visual 

### Question 3.4

Using the full weather dataset on the webpage create two plots of your on choice. For example, you can compare winter temperatures across years, differences in night and day temperatures etc. For each visual provide a brief motivation on why your choice of visual (e.g violin or boxplot) is a good choice.

In [None]:
# Insert code here 
# Remember to print the visual 

*Brief motivation*

In [None]:
# Insert code here 
# Remember to print the visual 

*Brief motivation*

## Amounts 

For this part reading Chapter 6 in [Fundamentals of Data Visualizations](https://clauswilke.com/dataviz/) can help.

### Question 4.1 

Barplots are the workhorse for plotting amounts. Create a dataframe with the values from Tab. 6.1 in the [course book](https://clauswilke.com/dataviz/visualizing-amounts.html), and recreate Fig. 6.1 and Fig. 6.3 in the course book.

In [None]:
# Insert code here for Fig. 6.1
# Remember to print the visual 

In [None]:
# Insert code here for Fig. 6.2
# Remember to print the visual 

### Question 4.2 

Often we want to highlight a specific column in a barplot. Using the same layout as in the most recent plot highlight the column of Jumanji in orange, and keep the remaining columns grey.

In [None]:
# Insert code here 
# Remember to print the visual 

Besides using colors to highlight, adding numbers for the highlighted category can also help. For the Jumanji column also add the number that the bar corresponds to.

In [None]:
# Insert code here 
# Remember to print the visual 

### Question 4.3

On the webpage I have uploaded a dataset on the number of marriages in Stockholm, Gothenburg, Malmo and rest of Sweden for 2020, 2015, 2010 and 2005. Using this dataset produce i) a visual where it is easy to see which year Stockholm had the second most marriages, and ii) a visual where it is easy to see how many more marriages there was in Stockholm compared to Gothenburg in 2015. **In each visual I want you to include the number of marriages for each city and year**. 

In [None]:
# Insert code here for part i)
# Remember to print the visual 

In [None]:
# Insert code here for part ii)
# Remember to print the visual 

### Question 4.4

When plotting the mobile operating system data in the lectures I used a classical barplot. Another way which makes it easy to track the trend over several years or across companies is a common line-plot. Using the dataset on the webpage recreate the present on the webpage.

In [None]:
# Insert code here 
# Remember to print the visual 

By default plotnine places the legend for a visual to the right, or bottom depending on a theme. Do you think having a legend next to the visual is the best solution?

__Write answer here__

### Question 4.5

In the lecture I provided data on European nations median lifespan. Now using the full dataset (on the webpage) provided on the webpage select a subset of countries and visualize i) life-expectancy across a timespan of your choice and ii) life expectancy in 2020. Provide a brief motivation for your choice of visual - and why your choice is readable.



In [None]:
# Insert code here for part1 
# Remember to print the visual 

*Brief motivation*

In [None]:
# Insert code here for part2 
# Remember to print the visual 

*Brief motivation*