# Informatics 2 - Foundations of Data Science
## S1 Week 03: Visualisation for Exploratory Data Analysis - using Tidy Data and Seaborn

**Learning outcomes:** 
In this lab you will learn how to produce visualisations for exploratory data analysis. By the end of the lab you should be able to:
- explore relationships between variables with scatter and line plots
- use aesthetic elements like colour to represent variables
- explore how distributions differ across variables with histograms and box plots

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os
import seaborn as sns
# Package to display the hints and soultions
from common.show_solutions import show

## A. Introduction

### A.1 Exploratory Data Analysis

The focus of exploratory data analysis (EDA) is to use the power of visualisation to help you as the analyst learn as much as possible about the data quickly. In EDA, visualisations should be informative and insightful, and understandable by you and your close colleagues. However, they don't need to be very beautiful or polished - this  can wait to the data communication phase, which we'll cover in the next lab.

## A.2 Tidy Data and the Grammar of Graphics

As covered in the lecture on Data, in a tidy dataset, each column corresponds to a variable (or attribute) and each row corresponds to an instance or observation possessing each of the variables. Tidy data makes data exploration easy when paired with the "[grammar of graphics](https://en.wikipedia.org/wiki/Wilkinson%27s_Grammar_of_Graphics)" approach.

The grammar of graphics is a method of producing visualisations in which variables are mapped directly onto different aesthetic elements found in a visualisation. For example, we might want to map one variable to the x-axis, another to the y-axis and a third variable to the colour of whatever it is we are plotting. Don't worry if this abstract description is confusing - it should hopefully make more sense when you see some examples! 

There are many different visualisation libraries out there, the one we will use today is Seaborn. As always, this lab is not comprehensive and only covers a small portion of Seaborn's capability. It is recommended that you check out the [documentation](https://seaborn.pydata.org/) to find out more.

We won't utilise the full grammar of graphics approach in this lab but, if you are interested, [Seaborn objects](https://seaborn.pydata.org/tutorial/objects_interface.html) are a good place to start.

## A.3 Seaborn

Visualisations used in data science can often be classified as distributional or relational. Distributional visualisations show the distribution of a certain variable (or several variables), which can help understand important statistics such as the mean, median, mode, and variance. Relational visualisations illustrate how different variables relate to each other. Seaborn differentiates explicitly between these two groups of visualisations and also introduces a third category of categorical visualisations, which involve a categorical variable on one axis. Categorical visualisations can again be distributional or relational, but sometimes need a different approach.

These three types of visualisation are created with `sns.distplot()`, `sns.relplot()`, and `sns.catplot()`. 

All three of these plots accept similar arguments, starting with the data, then a column name (in quotation marks) for each of the aspects of the plot that we want to map to a variable. For example, to map the columns `variable1` and `variable2` in a dataframe `df` to the x and y axis, we might write something like:

```
sns.___plot(       # Here ___ could be dist, rel or cat
    data = df,
    x = "variable1",
    y = "variable2"
)
```
Filling in the blanks appropriately. If we had more variables we'd like to visualise in the same plot, we can try mapping them to `hue` (colour), `style` (marker style, e.g. circle, cross, triangle), and `size` (size of the points):
```
sns.___plot(
    data = df,
    x = "variable1",
    y = "variable2",
    hue = "variable3",
    style = "variable4",
    size = "variable5"
)
```

## B. Relational plots

When we want to see how two variables relate to each other, we use a relational plot. Seaborn offers two kind of relational plots, scatter plots and line plots. We choose which kind of relational plot to use with the `kind` argument:
 - Scatter plot: `sns.relplot(..., kind = "scatter")` (scatter plots are the default, so `kind = "scatter"` can be omitted)
 - Line plot: `sns.relplot(..., kind = "line")`

**Exercise 1:** Load the CO$_2$ emissions data created in the last lab from the datasets folder and store the data in `co2_emissions`. Print the first few lines to get a feel for the data.

In [None]:
# Run this cell to be offered with hints and solution
show(week=3, question=1)

In [None]:
# Your code 


Suppose we are interested in seeing how the annual CO<sub>2</sub> emissions change over time. This is a question that relates several variables, so we want to use a relational plot. We can map the year to the x-axis and the annual CO<sub>2</sub> emissions to the y-axis.

In [None]:
sns.relplot(
    data = co2_emissions,
    x = "year",
    y = "annual_co2_tonnes"
)

Scatter plots are the default type of relational plot in seaborn. One point (marker) is drawn for each row in the `co2_emissions` dataframe. It might be possible to start to make conclusions about the trend in CO$_2$ emissions over time, but it is pretty clear that there is an additional variable at play here. To include regions in the plot we can, for example, colour (or hue) for each region by adding `hue = "region"` to our seaborn call.

In [None]:
sns.relplot(
    data = co2_emissions,
    x = "year",
    y = "annual_co2_tonnes",
    hue = "region"
)

This gives us a much clearer picture. Given that this is a time-series plot, a line plot might make interpretation easier. We can change the scatter plot to a line plot with the `kind = line` argument.

**Exercise 2:** Create a line plot with `year` on the x-axis, `annual_co2_tonnes` on the y-axis and with colour determined by `region`.

In [None]:
# Run this cell to be offered with hints and solution
show(week=3, question=2)

In [None]:
# Your code


**Discussion:** What conclusions could you make from the above visualisation? What further questions might you want to ask?

Your answer:

Instead of using colour to distinguish regions, we could plot them on separate axes (or facets), which have the same axis scales.  Note that `col` in the code below means "column", not "colour", and that `col_wrap` controls how many columns there are.

In [None]:
plot = sns.relplot(
    data = co2_emissions,
    x = "year",
    y = "annual_co2_tonnes",
    kind = "line",
    col = "region",
    col_wrap = 5
)

**Discussion:** what are the advantages and disavantages of the two previous plots for exploratory data analysis, using colours or facets to separate regions? When might each type of plot be useful?

In [None]:
# Run this cell to be offered with hints and solution
show(week=3, question=2.1)

Your answer:

__Note:__ This facetted plot is an example of a "small multiples" design - repeating the same set of axes, aligned so that the eye can compare the data easily.  

## C. Distribution plots

Distributional plots allow us to see the different possible values of a particular variable, and with what frequency they occur. Seaborn offers three kinds of distributional plots, however the easiest to use and interpret is the histogram which is also the default option. If you are interested in learning about the other kinds of distributional plots, refer to the [Seaborn documentation](https://seaborn.pydata.org/generated/seaborn.displot.html).

First we will load a well-known [UCI](https://www.kaggle.com/uciml/pima-indians-diabetes-database) dataset of female patients over 21 years old and of Pima Indian heritage, in order to explore risk factors for diabetes. For more information on how the data were collected, see the original paper: 

Smith JW, Everhart JE, Dickson WC, Knowler WC, Johannes RS. [Using the ADAP Learning Algorithm to Forecast the Onset of Diabetes Mellitus.](https://pmc.ncbi.nlm.nih.gov/articles/PMC2245318/) Proc Annu Symp Comput Appl Med Care. 1988 Nov 9:261–5. PMCID: PMC2245318.

In [None]:
diabetes = pd.read_csv("datasets/diabetes.csv")
diabetes.head()

Distributional plots follow very similar syntax to relational plots. We begin by looking at the distribution of age in the dataset.

In [None]:
sns.displot(
    data = diabetes,
    x = "Age"
)

**Discussion:** In many cases, one expects a normal distribution (sometimes referred to as a Gaussian distribution), when looking at a distribution of a variable from a whole population, e.g. height, IQ or body mass index. However, this is not the case above. Discuss with your lab partner (a) why the above distribution might still be a good representation of the true distribution and (b) what type of skew the distribution illustrates and why? Write down your answers:

In [None]:
# Run this cell to be offered with hints and solution
show(week=3, question=2.2)

Your answer:

**Exercise 3:** it is well known that obesity (defined as having a BMI of over 35) is a risk factor for diabetes. Investigate this by plotting a histogram mapping BMI to the x axis and the outcome to the hue (colour).

In [None]:
# Run this cell to be offered with hints and solution
show(week=3, question=3)

In [None]:
# Your code


**Discussion:** you should notice that there are some people with a recorded BMI of 0. How should we interpret this information?

In [None]:
# Run this cell to be offered with hints and solution
show(week=3, question=3.1)

Your answer:

Another possible distributional plot here is a density plot. A density plot is essentially a smoothed histogram, which can sometimes result in a less-cluttered looking visualisation. It's important to note that the y-axis here is the _density_, analogous to the _frequency density_ sometimes seen in histograms and represents the "likelihood" at that point. The total area under the curve is equal to 1.

In seaborn, we can produce a density plot with `kind = 'kde'`, which stands for "Kernel Density Estimate".

In [None]:
sns.displot(
    data = diabetes,
    x = 'BMI',
    hue = 'Outcome',
    kind = 'kde'
)

Notice that the x-axis extends to the negative numbers in the density plot, and these appear to have positive density. As part of the smoothing Seaborn performs, both tails of the curve must be zero without any sharp cut-off. It is, however, possible to restrict the density plot to only where there is data using the `cut=0` argument. You must be careful when interpreting density plots and, if you do wish to use them, we recommend reading the [documentation](https://seaborn.pydata.org/generated/seaborn.kdeplot.html).

## D. Categorical Plots

Sometimes our data is not continuous, but categorical. In these instances, we have to think more carefully about how we visualise our data and so seaborn has a third type of visualisation, `sns.catplot()`. There are many types of categorical plots in seaborn, which can be found in the [seaborn documentation](https://seaborn.pydata.org/generated/seaborn.catplot.html#seaborn.catplot).

### D.1 Box plots

Boxplots (also known as box and whisker plots) are one-dimensional representations of a distribution in which the box extends from the lower (first) to the upper (third) quartile values of the data, while the line across the box represents the median. The 'whiskers', the lines extending from the box, can represent different things, as described in the [Wikipedia article on boxplots](https://en.wikipedia.org/wiki/Box_plot). By default, Matplotlib defines the end of the upper whisker as the value of the largest data point that lies within 1.5 times the interquartile range from the upper quartile, and the lower whisker as the value of the smallest data point that lies within 1.5 times the interquartile range from the lower quartile. Data points that lie outwith the whiskers are called outliers, and are represented by dots or circles.

**Exercise 4:** Compute the upper and lower quartile, the median, the mean, and the standard deviation of BMI, for patients with and without diabetes separately, using functions from the pandas library.

In [None]:
# Run this cell to be offered with hints and solution
show(week=3, question=4)

In [None]:
# Your code


Now that we've computed these values by hand, let's see how Seaborn visualises them. Box plots are considered a categorical plot in seaborn, so can be created with `sns.catplot(..., kind = "box")`.

In [None]:
sns.catplot(
    data = diabetes,
    kind = "box",
    x = "Outcome",
    y = "BMI"
)

Again, we see that a higher BMI is associated with diabetes in the dataset as a whole. But is this a fair comparison? Suppose that BMI is associated with the number of pregnancies; when we restrict the analysis to compare only women who have had the same number of pregnancies, does the association between BMI and diabetes still hold? 

Let's investigate the association between diabetes and BMI, across different numbers of pregnancies.

**Exercise 5:** Produce a box plot with the number of pregnancies on the x axis, BMI on the y axis, and colour (hue) used to distinguish patients with and without diabetes (outcome).

In [None]:
# Run this cell to be offered with hints and solution
show(week=3, question=5)

In [None]:
# Your code


**Discussion:** Does the association between BMI and diabetes hold for women with differing numbers of pregnancies?

In [None]:
# Run this cell to be offered with hints and solution
show(week=3, question=5.1)

Your answer:

**Discussion:** Boxplots represent distributions in a very simplified manner. Discuss with your lab partner when and how a box plot could misrepresent information. What would be an example of a distribution that would be badly represented by a boxplot?

In [None]:
# Run this cell to be offered with hints and solution
show(week=3, question=5.2)

Your answer:

## E. Deciding what to visualise

When there are so many variables in a dataset, it can be overwhelming trying to think of interesting questions to ask and exploring the data through visualisations. Often, a good idea can be to produce one big array of plots that shows many relationships and distributions in one go. A pair plot produces a scatter plot comparing each pair of variables, as well as a density plot for each variable. 

Once again, we can use colour to help distinguish between patients with and without diabetes.

In [None]:
sns.pairplot(diabetes, hue = "Outcome")

**Discussion:** How does what you see in the dataset relate to the plots above? What questions do you now have about the data?

In [None]:
# Run this cell to be offered with hints and solution
show(week=3, question=5.3)

Your answer:

## F. An exercise in plotting multivariate data

Now it's your turn to fuse some data and generate some exploratory plots using the principles introduced above.

**Exercise 6:** 

 a) Load the `co2_emissions_vs_gdp.csv` dataset from the `datasets` folder. 
 
 b) Visualise the CO<sub>2</sub> emissions per capita, GDP per capita and total population all in one plot
 

In [None]:
# Run this cell to be offered with hints and solution
show(week=3, question=6)

In [None]:
# Your code


**Exercise 7:**

a) In the last lab, the `drinks_by_country.csv` dataset contained the continents. Load that dataset (we have made a copy in this lab's dataset folder).

b) Merge the new DataFrame with the previous DataFrame (`co2_vs_gdp`). Careful: In `co2_vs_gdp` the country column is called `Entity` whereas in `drinks_by_country` the country column is called `Country`.

c) Try to create another visualisation with  CO<sub>2</sub> emissions per capita, GDP per capita, total population and continent all in one plot

In [None]:
# Run this cell to be offered with hints and solution
show(week=3, question=7)

In [None]:
# Your code


We will use the merged data frame in the next lab, so we save it as a .csv file:

In [None]:
co2_vs_gdp_continent.to_csv("datasets/co2_vs_gdp_continent.csv", index_label=False)

**Discussion:** Look back through the visualisations created in this lab and reflect on [Visualisation Principle 1: show the data](./FDS-visualisation-principles.pdf). Can you identify which aspects of the guidance relate to each plot?