<div class='bar_title'></div>

*Introduction to Data Science (IDS)*

# Data Visualization with Lets_Plot

Gunther Gust <br>
Chair for Enterprise AI<br>
Data Driven Decisions (D3) Group<br>
Center for Artificial Intelligence and Data Science (CAIDAS)

<img src="images/d3.png" style="width:20%; float:left;" />

<img src="images/CAIDASlogo.png" style="width:20%; float:left;" />

## Sources
This lecture relies mainly on https://aeturrell.github.io/coding-for-economists/vis-letsplot.html.

<img src="https://raw.githubusercontent.com/vhaus63/ids_data/main/ao_data_vis.png" style="width:80%; float:left;" />

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from lets_plot import *

LetsPlot.setup_html() # Set up Lets-Plot for HTML output (useful in Jupyter notebooks)

## Motivation

As a data scientist, you will often face the challenge to understand a given dataset. For a first intuition, plots can be very helpful in this task. `lets-plot` allows you to create a wide range of plots with just a few lines of code and simple to understand syntax. Let's look at an example:

We will take a quick look at the __tips dataset__ from seaborn.

In [None]:
df = sns.load_dataset("tips")
df.head()

Imagine, you want to find out if there are differences in tipping behaviour based on the smoking status of a person and the day of the week. We can easily do this with `lets-plot`:

In [None]:
(
    ggplot(df, aes(x="tip", y="total_bill", color="smoker"))
    + geom_point(size=3)
    + facet_wrap(["smoker", "day"])
)

This plot shows that higher total bills generally lead to higher tips across all groups. Additionally, smokers tend to have more variability in their tips, especially on weekends, compared to non-smokers who show a more consistent tipping pattern.

## Layers of a visualization
Every plot that you will create in Python can be broken down into the following seven layers:

<img src="https://datathon-ufrgs.github.io/Pintando_e_Bordando_no_R/images/camadas2.png" height=200>

The nice thing about lets-plot is that it allows you to control each of these layers separately and to simply stack them on top of each other. Here are some more details about the distinct layers:

### Data
* The dataset to be visualized
* In lets-plot, you typically pass a pandas DataFrame as the data source
* Grammar requires a tidy format (remember last lecture)

### Aesthetics
Defines the mapping of data variables to visual properties like x, y, color, fill, etc. This is done within the aes() function.

### Geometries
* Specifies the type of plot or layer, such as points, lines, bars, etc.
* Determines how the data will be visually represented

### Facet Mapping
Allows for creating small multiples or splitting the data into subplots based on one or more categorical variables. Panel layout may carry meaning.

### Statistics
* Even though data is tidy it may not represent the displayed values
* Transform input variables to displayed values:
   * Count number of observations in each category for a bar chart
   * Calculate summary statistics for a boxplot.
* Is implicit in many plot-types but can often be done prior to plotting

### Coordinates
Controls the scaling and transformation of the plot's coordinate system. For example, you can swap x and y axes, or set limits to zoom in on a specific area of the plot.

### Theme
* None of the priors talked about the visual look of the plot.
* Theming spans every part of the graphic that is not linked to data
* Elements like the background, grid lines, and fonts can be specified

We will work with the Palmer Penguins Dataset, which will be imported in the following command together with pandas and lets-plot. At the beginning, we will now go through the three key components that a plot is made out of: __data, aesthetics and geoms.__

# Palmer Penguins Dataset

The Palmer Penguins dataset contains information about three penguin species observed in the Palmer Archipelago, Antarctica. Here are the main columns in the dataset:

| Column             | Description                                                                                   |
|--------------------|-----------------------------------------------------------------------------------------------|
| **species**        | The species of the penguin, which can be one of three types: Adelie, Gentoo, or Chinstrap.    |
| **island**         | The island in the Palmer Archipelago where the penguin was observed: Biscoe, Dream, or Torgersen. |
| **bill_length_mm** | The length of the penguin's bill ("Schnabel") in millimeters.                                       |
| **bill_depth_mm**  | The depth (height) of the penguin's bill in millimeters.                                      |
| **flipper_length_mm** | The length of the penguin's flipper ("Flosse") in millimeters.                                        |
| **body_mass_g**    | The body mass (weight) of the penguin in grams.                                               |
| **sex**            | The sex of the penguin, either male or female.


In [None]:
from palmerpenguins import load_penguins

penguins = load_penguins()
penguins.head()

## Basics

Every plot has three key components: __data, aesthetic mappings, layers__ (at least one, called geoms). Here's a simple example:

In [None]:
(
    ggplot(penguins, aes(x = "body_mass_g", y = "flipper_length_mm")) +
  geom_point()
)

We can play around with the look of the plot:

In [None]:
(
    ggplot(penguins, aes(x = "body_mass_g", y = "flipper_length_mm")) +
  geom_point(size=3, alpha=.5, shape=23, fill='green', color='blue', stroke=1.5)
)

You can  find a short overview over the available parameters at https://lets-plot.org/python/pages/aesthetics.html.

Note that data and aesthetic mappings were supplied to a function called `ggplot`, which accepts the *data* and *aes*, then layers/*geoms* are added on with +. The pattern will be similar for all **lets-plot** charts.

Note that the variables `x` and `y` in the `aes` call are necessary positional arguments, so you can simply omit saying `x=` and `y=` like this:

In [None]:
(
  ggplot(penguins, aes("body_mass_g", "flipper_length_mm"))
  + geom_point()
)

## Adding extra dimensions: shape, colour, and size

In [None]:
(
  ggplot(penguins, aes("body_mass_g", "flipper_length_mm", colour="island"))
  + geom_point()
)

You can see that this has rendered the categorical variable "island" by having it appear in different colours. A legend has automatically been added. Do remember that not everyone can see all colours well, so it's best to use colourblind-friendly colour scales whenever possible.

Note that we can create the same plot placting the __aesthetics inside `geom_point`__ since this is the only layer here. In the code above, the aesthetics were defined __globally__ for the entire plot.

In [None]:
(
  ggplot(penguins)
  + geom_point(aes("body_mass_g", "flipper_length_mm", colour="island"))
)

This behaviour becomes more clear if we add another layer like `geom_line`.

In [None]:
(
  ggplot(penguins)
  + geom_point(aes("body_mass_g", "flipper_length_mm", colour="island"))
  + geom_line(aes("body_mass_g", "flipper_length_mm"))
)

In [None]:
(
  ggplot(penguins, aes("body_mass_g", "flipper_length_mm", colour="island"))
  + geom_point()
  + geom_line()
)

Let's look at shape too:

In [None]:
(
  ggplot(penguins, aes("body_mass_g", "flipper_length_mm", shape="island"))
  + geom_point()
)

Although we previously set the size of the points overall, we can use them as an aesthetic too:

In [None]:
(
  ggplot(penguins, aes("body_mass_g", "flipper_length_mm", size="island"))
  + geom_point(alpha=0.5)
)

### Facets

You can use facets (aka small multiples) to display more dimensions of information too. To facet your plot by a single variable, use `facet_wrap()`. The first argument of `facet_wrap()` tells the function what variable to have in successive charts. The variable that you pass to `facet_wrap()` should be categorical.

In [None]:
(
    ggplot(penguins, aes("body_mass_g", "flipper_length_mm"))
    + geom_point()
    + facet_wrap(facets="island", ncol=3)
)

## Exercise 1

(a) What is wrong with this code - can you fix it to make all points blue?

In [None]:
(
    ggplot(data=penguins)
    + geom_point(aes(x='body_mass_g', y='flipper_length_mm', color="blue"))
    + labs(title="Penguins Body Mass vs Flipper Length (Fixed Color)")
)

In [None]:
# your code here

(b) We want to create a plot that shows the `flipper_length_mm` (y-axis) based on the `body_mass_g` (x-axis). The size of the points should depend on `flipper_length_mm`. Fix this code.

In [None]:
(
    ggplot(data=penguins)
    + geom_point(aes(x='body_mass_g', y='flipper_length_mm'), size="flipper_length_mm")
)

In [None]:
# your code here

## Plot Geoms

We can substitute `geom_point()` for a different geom function in order to highlight different aspects of data. Here are some examples:

-   `geom_smooth()` fits a smoothed conditional line then plots it and its standard error.

-   `geom_boxplot()` produces a box-and-whisker plot to summarise the distribution of a set of points.

-   `geom_histogram()` and `geom_density()` show the distribution of continuous variables.

-   `geom_bar()` shows counts of categorical variables.

-   `geom_path()` and `geom_line()` draw lines between the data points.
    A line plot is constrained to produce lines that travel from left to right, while paths can go in any direction.
    Lines are typically used to explore how things change over time.

Let's take a closer look:

### Fitting a line

In [None]:
(
  ggplot(penguins, aes("body_mass_g", "flipper_length_mm"))
  + geom_point()
  + geom_smooth(method="loess")
)

You can use a linear model instead with `method="lm"` (this is the default).

### Jittered points and boxplots

These are especially useful when we have lots of data that overlap, or want to get more of an idea of the overall distribution, or both.

In [None]:
(
    ggplot(penguins, aes("island", "body_mass_g"))
    + geom_jitter()
)

Box plots are created via:

In [None]:
(
    ggplot(penguins, aes("island", "body_mass_g"))
    + geom_boxplot()
)

### Histograms and probability density plots

In [None]:
(
    ggplot(penguins, aes("body_mass_g"))
    + geom_histogram()
)

`geom_histogram()` has a `bins=` keyword argument that should be chosen carefully.

In [None]:
(
    ggplot(penguins, aes("body_mass_g"))
    + geom_density()
)

### Bar Charts



In [None]:
(
    ggplot(penguins, aes("species"))
    + geom_bar()
)

These are as you'd expect, but if you don't want a count of the number of items but just to __display the given values__, you can use the keyword argument `stat="identity"`.

In [None]:
# Sample data
data = pd.DataFrame({
    'category': ['A', 'B', 'C'],
    'value': [10, 20, 15]
})

# Create a bar plot using the actual values in the data
ggplot(data, aes(x='category', y='value')) + \
    geom_bar(stat='identity')

### Line charts and time series

Let's create a sample dataset for a timeseries that contains information on the temperature per day:

In [None]:
np.random.seed(0)
date_range = pd.date_range(start="2024-01-01", end="2024-01-31", freq='D')
temperature = 20 + np.random.normal(0, 2, len(date_range)).cumsum()  # Simulated temperature data

temperature_data = pd.DataFrame({'date': date_range, 'temperature': temperature})
temperature_data.head()

In [None]:
(
    ggplot(temperature_data, aes(x='date', y='temperature'))
    + geom_line(color='blue', size=1)
    + geom_point(color='blue', size=7, alpha=0.5)
)

## Labels and Titles

`xlab()` and `ylab()` modify the x- and y-axis labels:

In [None]:
(
  ggplot(penguins, aes("body_mass_g", "flipper_length_mm"))
  + geom_point()
  + xlab("Body mass (g)")
  + ylab("Flipper length (mm)")
)

But you can also specify all labels and titles at once:

In [None]:
(
    ggplot(penguins, aes(x="flipper_length_mm", y="body_mass_g"))
    + geom_point(aes(color="species", shape="species"))
    + geom_smooth(method="lm")
    + labs(
        title="Body mass and flipper length",
        subtitle="Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
        x="Flipper length (mm)",
        y="Body mass (g)",
        color="Species",
        shape="Species",
    )
)

## Limits on axes

You always have an option when it comes to removing points from your data: you can __filter your dataframe__ or change the __limits on your axes__ when you are plotting data. If you wish to do the latter, use the `xlim` and `ylim` comnands to do this.

In [None]:
(
ggplot(penguins, aes(x="flipper_length_mm", y="body_mass_g"))
  + geom_point(size=4)
  + xlim(200, 230)
  + ylim(3e3, 5e3)
)

## Themes
`lets-plot` provides several built-in themes to customize the appearance of your plots.

In [None]:
themes = [
    ('Black and White', theme_bw()),
    ('Minimal', theme_minimal()),
    ('Minimal 2', theme_minimal2()),
    ('Light', theme_light()),
    ('Classic', theme_classic()),
    ('None', theme_none())
]

# Create the plot object
plot = ggplot(penguins) + geom_point(aes(x='bill_length_mm', y='bill_depth_mm', color='species'))

# Loop through each theme, apply it, and display the plot with a title
for theme_name, theme in themes:
    p = plot + theme + ggtitle(f"Theme: {theme_name}")
    p.show()

You can define __custom themes__ in lets-plot by modifying various plot elements such as the background, grid lines, text size, axis labels, etc. This is done by using the theme() function, which allows you to specify different theme elements to customize the appearance of your plot.

For more information on this you can have a look at this Notebook: https://nbviewer.org/github/JetBrains/lets-plot-docs/blob/master/source/kotlin_examples/cookbook/themes.ipynb

## Scales
In `lets-plot`, scales are used to map data to visual properties (aesthetics) such as color, size, shape, etc. Each aesthetic inside `aes()` gets assigned a scale, either by default or explicitly through the user. You can customize scales to adjust how data is mapped to aesthetics. For example, you can set __custom color__ palettes, control the __range of values__ for size, or define __specific breaks__ for axes.

In [None]:
plot = ggplot(penguins) + geom_point(aes(x='bill_length_mm', y='bill_depth_mm', color='species', size='body_mass_g'))

custom_color_palette = plot + scale_color_discrete(palette='Set2')
custom_color_palette


For more options regarding `palette`, see [here](https://ggplot2-book.org/scales-colour#sec-colour-discrete). 

In [None]:
custom_scales = plot + scale_x_continuous(breaks=[35, 50, 59]) + scale_y_continuous(trans = 'reverse')
custom_scales

## Other useful-to-know elements of **lets-plot** charts

The [documentation](https://lets-plot.org/) of lets-plot is absolutely excellent and comprehensive — so you can find whatever you need there. But it may be useful to at least know of some further features we didn't look at in this lecture.

Here is a glimpse of what you can achieve with `lets-plot`:

Remark: This is just a screenshot of an interactive plot for which we will leave out the code. If you are interested in recreating it, feel free to follow along with https://www.kaggle.com/code/asmirnovhoris/bigquery-gis-and-lets-plot.

<img src="images/bike_plot.png" height=600>


The following code will give you a basic contour plot with contour lines representing the levels of the function $\sin(X^2 + Y^2)$

In [None]:
x = np.linspace(-3.0, 3.0, 100)
y = np.linspace(-3.0, 3.0, 100)
X, Y = np.meshgrid(x, y)
Z = np.sin(X**2 + Y**2)

data = pd.DataFrame({
    'x': X.ravel(),
    'y': Y.ravel(),
    'z': Z.ravel()
})

plot = ggplot(data) + geom_contour(aes(x='x', y='y', z='z'), color='blue')

plot


## Exercise 2

Recreate this plot:

<img src='images\violin.png' style='width:30%; float:left;' />

In [None]:
# your code here

## Saving your plot to a file

Once you've made a plot, you might want to save it as an image that you can use elsewhere.
That's the job of `ggsave()`, which will save the plot most recently created to disk:


In [None]:
plotted_data = (
    ggplot(penguins, aes(x="flipper_length_mm", y="body_mass_g")) + geom_point()
)
ggsave(plotted_data, filename="penguin-plot.svg")

This saved the figure to disk at the location shown—by default it's in a subdirectory called "lets-plot-images".

We used the file format "svg". There are __lots of output options__ to choose from to save your file to. Remember that, for graphics, __*vector formats*__ are generally better than *raster formats*. In practice, this means saving plots in __svg or pdf formats__ over jpg or png file formats. The svg format works in a lot of contexts (including Microsoft Word) and is a good default. To choose between formats, just supply the file extension and the file type will change automatically, eg "chart.svg" for svg or "chart.png" for png. You can also save figures in HTML format.

## Other plotting libraries

There is many different options to choose from regarding plotting packages for python, where each of them has their own advantages. Among the most popular are:
- **Matplotlib:** A highly versatile library for creating static plots with extensive customization options for various chart types, from simple to complex.

- **Seaborn:** Built on top of Matplotlib, it simplifies the creation of attractive statistical plots with built-in themes and higher-level abstractions.

- **Plotly:** Focuses on interactive visualizations, ideal for web-based plots with features like zooming, real-time updates, and hover effects.

- **Altair:** A declarative visualization library for creating interactive and aesthetically pleasing plots with a simple syntax, suitable for statistical analysis.

If you want to see those libraries compared to each other directly, have a look at https://aeturrell.github.io/coding-for-economists/vis-common-plots.html#connected-scatter-plot.

## General advice on plotting

_Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink
in the smallest space._

Edward Tufte

Here are the main points to consider:

- Above all else, show the data
- Maximize the data-ink ratio
- revise and edit

### Decision Tree for  Visualization Choices

<img src="images/chart-types.png" style="width:80%; float:left;" />

## Next lecture: Data Acquisition

<img src="https://raw.githubusercontent.com/vhaus63/ids_data/main/ao_data_acquisition.png" style="width:100%; float:left;" />

## Mentimeter