# Introduction to Seaborn

`Seaborn` is a library for creating data visualizations. More specifically, it is commonly used to create statistical graphs. It is built on top of `matplotlib` but offers a very distinct interface that is closely integrated with `pandas` data structures.

By storing your in a `pandas` `DataFrame` and wrangling it into the right shape, `seaborn` lets you create complex visualizations with only a single line of code.

This notebook follows and expands upon the main points of the [official `seaborn` tutorial](https://seaborn.pydata.org/tutorial.html):

First, we introduce `pandas` and the data structure that `seaborn`'s plotting functions expect. Then, we give an overview of the different categories of plotting functions that `seaborn` offers. Finally, we will produce a variety of example visualizations and discuss how to customize their style.


## Data wrangling

`Seaborn` works best with data that is stored in tabular form in a `pandas` `DataFrame`. Since we want to focus on `seaborn` and data visualization in this session, we do not have time to discuss in-depth `DataFrame`s or the data wrangling capabilities of `pandas`. If you are interested in learning more about that, however, check out our [workshop on data wrangling using `pandas`](https://git.dartmouth.edu/lib-digital-strategies/RDS/workshops/data-science/dirty-data).

For the purposes of this session, we will zoom in on `pandas`' main function how to change the general structure of the data: `pivot`, `melt`, and `wide_to_long`.

To demonstrate these functions, we load two different datasets in two different shapes:


In [None]:
import pandas as pd

penguins_long = pd.read_csv("../data/penguins_long.csv")
flights_wide = pd.read_csv("../data/flights_wide.csv")

### Long-form versus wide-form data

Let's take a look at the two different datasets we loaded:


In [None]:
penguins_long

In [None]:
flights_wide

We notice that they have a very different structure:

In `penguins_wide`, we have one variable per column and one observation per row. In `flights_wide`, we have one row per level of one the variables (`year`), and one column per level of the other variable (the month). Each cell contains a value that is related to the specific levels of the variables shown across the columns and rows (i.e., the value in a particular year and a particular month).

<figure>
<img src="https://seaborn.pydata.org/_images/data_structure_19_0.png" style="width:60%">
<figcaption align="left">Long-form versus wide-form data. Each color denotes a variable.</figcaption>
</figure>

We often see the wide format in spreadsheets with few variables: It is intuitive to a human reader to find the intersection of the levels of interest of the two variables and then read the value. However, if we want to process the dataset computationally, this format has two major disadvantages:

- The meaning of the data is unclear because there is no label that clearly states what the numbers represent
- The format does not scale well to more than three variables: Each new variable would require repeating all levels of the variable expressed in the columns

The long format may seem at first glance much harder to read, but it does not suffer from the above issues: Every variable is clearly labeled by its column header, and adding a variable simply means adding another column.

For computational data processing, we therefore generally prefer the long format, also called [tidy data](https://vita.had.co.nz/papers/tidy-data.pdf)


### How to get your data tidy

Because the wide format is often considered more human-readable, we frequently find wide data tables _in the wild_. Fortunately, `pandas` comes with a set of functions to restructure a `DataFrame`: `melt`, `pivot`, and `wide_to_long`.

The `melt` method lets you move a variable, whose levels are expressed as different columns, to a single column that contains one row per level.

<div class="alert alert-block alert-info">

The name _melt_ probably derives from the metaphor of thinking of the data table as a chunk of ice that you expose to heat on the right: The table will start to "melt" from the right, losing columns that trickle down and pool at the bottom, making the table narrower, and longer.

</div>

Let's try it with the flights data:


In [None]:
flights_long = flights_wide.melt(
    id_vars="year",  # Which column should be used as the identifier variable (not melted)
    # All other columns will be melted!
    var_name="month",  # Name for the new variable/column
    value_name="n_passengers",  # Name for the column holding the values
)
flights_long

The inverse of the `melt` operation is `pivot`:


In [None]:
flights_long.pivot(
    index="year",  # Which column to use as the index (each level of this variable will be one row)
    columns="month",  # Create a new column for each level of this variable
)

Finally, the function `wide_to_long` can help you restructure more complex wide-form data with more than three variables. This is beyond the scope of this session, however.

<div class="alert alert-block alert-info">

You actually _can_ use wide-form data to create visualizations in `seaborn`. However, it is strongly encouraged to use long-form data to make the most out of `seaborn`'s ability to structure and label visualizations according to the variables in your data structure.


## The two categories of `seaborn` plotting functions

There is a great variety of plotting functions available within `seaborn`, but all of them fall into one of two categories: _axes-level_ functions or _figure-level_ functions.

An axes-level function is very similar to a `matplotlib` plotting function: It creates and returns a self-contained, single-axes plot.

A figure-level function, on the other hand, lets you create multiple axes with just a single line of code. We will see some examples of this in a little bit.

Figure-level functions also offer a single interface to multiple different axes-level functions:

<figure>
<img src="https://seaborn.pydata.org/_images/function_overview_8_0.png" style="width:60%">
<figcaption align="left">Organization of figure- and axes-level functions.</figcaption>
</figure>

In the next couple of sections, we will see how to use both kinds of functions.


## Visualizing statistical relationships

Two kinds of charts can be used to show the relationship between two or more variables: line plots and scatter plots.

Line plots work best when you have clear, continuous independent and dependent variables. The flights data, for example, is a time series and therefore fits this description.

You can create a line plot with `seaborn` by using the function `lineplot`, passing it the data, and specifying the name of the variable to be shown on the _x_ and _y_ axis, respectively:


In [None]:
import seaborn as sns

sns.lineplot(flights_long.query("year == 1950"), x="month", y="n_passengers")

In the above example, we only showed the series of a single year. If we keep the data from all years, there are now multiple values for `n_passengers` that we want to plot for each value of `month`. In this case, when there are several _y_ values for the same value of _x_, `seaborn` shows the mean of these values and the confidence interval:


In [None]:
sns.lineplot(flights_long, x="month", y="n_passengers")

If instead we would like to see a different line per year, we can use the `style` paramter. If we specify a variable as the argument, the data will be grouped by that variable and each group will be plotted with a different line style:


In [None]:
sns.lineplot(flights_long, x="month", y="n_passengers", style="year")

If the grouping variable has more than a few levels, this can be hard to process for the human eye, however. We could therefore also change the color of each line using the `hue` parameter:


In [None]:
sns.lineplot(flights_long, x="month", y="n_passengers", style="year", hue="year")

The penguins dataset is a good example for a collection of observations that do not have a sequential relationship. We could therefore visualize some of the variables in a scatter plot:


In [None]:
sns.scatterplot(penguins_long, x="bill_length_mm", y="flipper_length_mm")

Once again, we can use a categorical variable to plot different subgroups of the data differently. For example, we could plot each species in a different color:


In [None]:
sns.scatterplot(
    penguins_long,
    x="bill_length_mm",
    y="flipper_length_mm",
    hue="species",
    style="species",
)

Since the interface of these functions is so similar, the figure-level function `relplot` can be used to create either of them:


In [None]:
sns.relplot(
    penguins_long,
    kind="scatter",
    x="bill_length_mm",
    y="flipper_length_mm",
    hue="species",
    style="species",
)

The main advantage of `relplot` is that, as a figure-level function, we can now create multiple axes based on yet another grouping variable (so-called _facets_):


In [None]:
sns.relplot(
    penguins_long,
    kind="scatter",
    x="bill_length_mm",
    y="flipper_length_mm",
    hue="species",
    col="island",
    row="sex",
)

## Visualizing distributions of data

If we are intereted in visualizing the univariate distribution of a variable, we have a few options to choose from: a histogram, a probability density curve, a cumulative distribution function, or a marginal distribution. Just like with the relational plots, all of these functions are also accessible through a figure-level function.

Let's walk through them one by one, using the penguins as an example!

The histogram organizing the data by grouping them into ranges (a.k.a. _binning_) and then computes some sort of statistic on it, e.g. the count.

For example, we can quickly familiarize ourselves with the body mass of our penguins using the default settings of the function `histplot()`:


In [None]:
sns.histplot(penguins_long, x="body_mass_g")

By default, the calculated statistic in each bin is the number of observations that fall into it. We can change this by supplying and argument to the `stat` parameter. To get the percentage of the total data that fall into each range, we can for example do this:


In [None]:
sns.histplot(penguins_long, x="body_mass_g", stat="percent")

We can also change the width of each bin using the corresponding parameter:


In [None]:
sns.histplot(penguins_long, x="body_mass_g", stat="percent", binwidth=250)

Exercise:

- Plot the histogram in different colors for each `sex`! **Hint:** `histplot()` uses many of the same parameters as `scatterplot()`!
- Add a `y` variable of you choice. What do you see for a categorical `y`, and what for a continuous `y`?


If the sample size is sufficiently large, we can also estimate its probablity density function by using the `kdeplot()`, which stands for _kernel density estimation_ and is essentially a smoothed histogram:


In [None]:
sns.kdeplot(penguins_long, x="body_mass_g")

Exercise:

- Once again, use different colors for each level of the variable `sex`! What do you observe compared to using `histplot()`?
- Change the degree of smoothing by changing the parameter `bw_adjust` to different values in the range from `0.1` to `1.0`. What do you observere for very large and very low values of `bw_adjust`?


Another way to look at your data's distribution is by calculating the empirical cumulative distribution. This visualization is great if you want be able to show the perentiles in your data:


In [None]:
sns.ecdfplot(penguins_long, x="body_mass_g")

Sometimes we want to see the actual datapoints in addition to the computed statistics. For this case, we can add a marginal distribution to another plot using `rugplot()`:


In [None]:
sns.kdeplot(penguins_long, x="body_mass_g")
sns.rugplot(penguins_long, x="body_mass_g")

We can even do this in combination with plots that are of a different "family", e.g. `scatterplot`:


In [None]:
sns.scatterplot(
    penguins_long,
    x="bill_length_mm",
    y="flipper_length_mm",
)
sns.rugplot(
    penguins_long,
    x="bill_length_mm",
    y="flipper_length_mm",
)

## Visualizing categorical data


## Creating multi-plot visualizations


## Styling figures
