# **Data Visualization with Python using Plotnine**
**Instructor:** *Bianca Peterson*

## **Introduction**
Data visualization is a powerful tool for understanding data and communicating insights. In this tutorial, we will use Plotnine, a Python library based on the grammar of graphics, to create various types of plots.
Python offers robust built-in plotting tools like matplotlib. However, in this tutorial, we'll focus on using the plotnine package. Plotnine enables the creation of insightful plots from structured data, drawing inspiration from ggplot2 in R and Leland Wilkinson's Grammar of Graphics. Built on Matplotlib, plotnine seamlessly integrates with Pandas, making it ideal for visual data exploration and analysis.

## **Overview**
In this tutorial we look at some of the data on wealth and life expectancy of countries over time used by Hans Rosling, known as gapminder.
The goal is to provide an overview of how to graph a variable (data) depending on its type, introduce some simple 1D and 2D plots constructed using plotnine and provide an outline of the layered grammar of graphics.

## **Learning objectives**
 - Generate plots from data according to their type (discrete, continuous …)
 - Manage plot settings
 - Produce plots from data in a data frame
 - Modify and customize a plot
 - Create complex and fancy plots


### Installing and loading packages

In [None]:
# Loading/installing packages
# !pip install gapminder
import pandas as pd
import numpy as np
from plotnine import *
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Loading the dataset
from gapminder import gapminder

### Data Structure and Overview
Let's have a look at our data structure.


In [None]:
# Inspecting the data structure
print(gapminder.info())
print(gapminder.head())

It is useful to get some overview of the variables before getting started

In [None]:
print(gapminder.describe(include='all'))

### Trends Over Time by Continent
We will want to look at trends over time by continent. How many countries are in this data set in each continent? There are 12 years for each country. Are the data complete?


In [None]:
# Checking the completeness of data - create a frequency table


## 1D Plots: Bar Plots for Discrete Variables
As we have seen previously during the lecture, the distribution of a categorical variable is better visualized using a bar plot. For example, continent.
With `plotnine`, this is relatively easy.
- we start by mapping the `x` variable to `continent`
- then, we add a `geom_bar()` layer, that counts the observations in each category and plots them as bar lengths.

The `aes()` function maps columns of data onto graphical attributes–such as colors, shapes, or x and y coordinates. (The name `aes()` is short for aesthetic.)


In [None]:
# Bar plot for continents


To make this more colorful, you can also map the `fill` attribute to `continent`.

In [None]:
# Bar plot for continent with fill color


With `plotnine` features, we are also able to:
- change the default color schemes
- modify labels
- change the legend position, or eliminate it in some cases
- flip axis …

Let's try some!

- We will change the y axis, `count`, in `geom_bar()` to `..count../12` in order to represent the number of countries.
- Change the label of the y axis by a more meaningful one: `Number of countries`
- Suppress the default legend for continent, which is redundant in this case


In [None]:
# Customizing the bar plot


### Transforming Coordinates and Flipping Axes


Transforming coordinates using `coord_trans` function

In [None]:
# Transforming coordinates using coord_trans function


Flipping axes using  `coord_flip ` function

In [None]:
# Flipping axes using coord_flip function


## Saving plots

## 1D Plots: Density Plots for Continuous Variables
The `gapminder` data set contains several continuous variables: life expectancy (`lifeExp`), population (`pop`) and gross domestic product per capita (`gdpPercap`) for each year and country.
For such variables, density plots provide a useful graphical summary.
Let’s start by exploring life expectancy. The simplest plot uses this as the horizontal axis, `aes(x=lifeExp)` and then adds `geom_density()` to calculate and plot the smoothed frequency distribution.

In [None]:
# Density plot for life expectancy


We have several features to make this plot prettier. Changing the line thickness (`size`), add a fill color (`fill`), and make the fill color partially transparent (`alpha`).

In [None]:
# Prettifying the density plot


### Differences by Continent
The plot of life expectancy is bimodal (two peaks) and shows the difference in life expectancy in developing countries (left peak) and developed countries (right peak). We need to add another aesthetic attribute, `fill=continent`, which is inherited in `geom_density()` to see more details about countries among continents.

In [None]:
# Density plot for life expectancy by continent


**Note 1:** We used transparent colors ( `alpha`) to see the different distributions across each continent more clearly.

**Note 2:** It is easy now to see that African countries differ markedly from the rest.

## Boxplots and Other Visual Summaries
You might want to visualize the distributions of life expectancy by another visual summary, grouped by `continent`. All you need to do is change the aesthetic to show `continent` on one axis, and life expectancy (`lifeExp`) on the other.


In [None]:
# Boxplot for life expectancy by continent


 ### <span style="color:red">**CHALLENGE 1**</span>
1. Remove the legend from this plot
2. Make the plot horizontal
3. Instead of a boxplot, try `geom_violin()`


## Effect Ordering
The continents are a factor and are ordered alphabetically by default. It might be more useful to order them by the mean or median life expectancy.


In [None]:
# Reordering continents by median life expectancy

# Convert 'continent' to a categorical type and reorder by median life expectancy
median_life_exp = gapminder.groupby('continent', observed=False)['lifeExp'].median()

gapminder['continent'] = pd.Categorical(gapminder['continent'], categories=median_life_exp.sort_values().index, ordered=True)
# .index extracts the continent names in the correct sorted order

# Create a boxplot with reordered continents


# Show the plot


### Exploring GDP per Capita
Let’s look at the distribution of `gdpPercap ` in a similar way, starting with the unconditional distribution.

In [None]:
# Plotting the distribution of GDP per capita


### <span style="color:red">**CHALLENGE 2**</span>

1. Plot the distributions of GDP per capita separately for each continent.
2. Transform the x-axis to a log scale and add another layer for the log transformation.
3. Create boxplots of GDP per capita by continent, with and without a log scale.


## Faceting

Sometimes it might make sense to split a single plot into several smaller plots. We can do that by faceting the plot.

### Facet by 1 variable

In [None]:
# Plotting the distribution of GDP per capita for each continent


### Facet by 2 variables

In [None]:
# Plotting the distribution of GDP per capita for each continent per year


## Layers & Time series plots
### Layers
Exploring how life expectancy change with GDP per country, for example China. We can use `geom_line` to make a line plot.

In [None]:
# Line plot for China showing life expectancy versus GDP
# Filter data for China


#### Adding Points

We can use both `geom_line` and `geom_point` to make a line plot with points at the data values.


**Note:** This brings up another important concept with ggplot: layers. A given plot can have multiple layers of geometric objects, plotted one on top of the other.

#### Adding Colors

Adding colors to the lines and points to enhance visibility.


If we switch the order of `geom_point()` and `geom_line()`, we’ll reverse the layers.

**Note:** aesthetics that are included in the call to `ggplot()` (or completely separately) are made to be the defaults for all layers, but we can separately control the aesthetics for each layer. For example, we could color the points by year:

#### Color Points by Year

Coloring the points by year to show additional information.


With a rainbow:

In [None]:
# With a rainbow color scale
import matplotlib.pyplot as plt
china_plot_colored_by_year_rainbow = (ggplot(china, aes(x='gdpPercap', y='lifeExp'))
                                      + geom_line()
                                      + geom_point(aes(color='year'))
                                      + scale_color_gradientn(colors=plt.cm.rainbow(np.linspace(0, 1, 5))))
# np.linspace creates 5 evenly spaced numbers between 0 and 1, which represents the positions along the color map
# plt.cm.rainbow samples 5 colors evenly distributed across the rainbow spectrum
# scale_color_gradientn creates a continuous color scale

china_plot_colored_by_year_rainbow.show()

Coloring both points and lines:

In [None]:
# With a rainbow color scale
import matplotlib.pyplot as plt

china_plot_colored_by_year_rainbow = (ggplot(china, aes(x='gdpPercap', y='lifeExp'))
                                      + geom_line()
                                      + geom_point()
                                      + scale_color_gradientn(colors=plt.cm.rainbow(np.linspace(0, 1, 5)))
                                      + aes(color='year'))

china_plot_colored_by_year_rainbow.show()

### <span style="color:red">**CHALLENGE 3**</span>

Make a plot of life expectancy vs. GDP per capita for China and India, with both lines and points. Make a separate line for each country.

In [None]:
# Filter data for China and India


## Time Series Plot

Exploring how life expectancy has changed over time. We use the `group` aesthetic to create a line for each country.


#### Adding Colors

Adding color to the lines based on continent.


#### Changing Color Shade

Changing the color shade (make it more transparent) to make the plot more readable.


### Plotting a Summary

A better way to look at trends over time is to find the mean or median for each `year` and `continent` and plot those.


In [None]:
summary = (gapminder.groupby(['continent', 'year'],observed=False)
           .agg({'lifeExp': 'median'})
           .reset_index())



Let’s play with our plot and make it more fancy! 

We can fit linear regression lines for each `continent` instead of joining all the points:

In [None]:
# Linear regression


We can also use a `loess` smooth rather than a linear regression:

In [None]:
# Loess smooth
# !pip install scikit-misc


We can change the default use of legends by placing it inside the plot:

In [None]:
# Changing legend position


## Scatterplots

Let’s explore the relationship between life expectancy and GDP with scatterplots.

A basic scatterplot is set up by assigning two continuous variables to the `x` and `y` aesthetic attributes then we can add the points in another layer.


Or, color them by continent.

In [None]:
# Color by continent


For a better look, we can also add a smoothed curve for all the data:

In [None]:
# Add smoothed curve for each continent


In [None]:
# One smoothed curve for all continents


As we have seen earlier about GDP, this variable is better plotted on a log scale:

In [None]:
# Log scale for each continent


In [None]:
# Log scale for all continents


### Customizing the Plot

Adjusting scale labels, legend position, and theme.

The last plot, on the log scale has ugly labels (written in scientific notation), let’s try to adjust the scale:

In [None]:
from plotnine import scale_x_log10, theme_bw

# Adjusting scale labels


Moving the legends inside the plot:

In [None]:
# Placing legend inside the plot


Changing the theme:

What happened to our legend? How can we fix it?

Replacing the single loess smoothed curve with a separate regression line for each continent:

In [None]:
#smoothing by a regression line for each continent


Let's make a “bubble” plot by mapping the `size` of each point to population (`pop`). We'll keep the legend outside the plot for this example.

Changing color transparency:

### Exploring Life Expectancy by Continent for a Given Year

Filtering data to show life expectancy for the year 2007.


In [None]:
# Filter data for 2007
gm_2007 = gapminder[gapminder['year'] == 2007]

# Plot life expectancy by continent
gm_2007_plot = (ggplot(gm_2007, aes(y='lifeExp', x='continent'))
                + geom_point())
gm_2007_plot.show()

In [None]:
# Changing scale by jittering
# Plot with jitter


## Advanced Customized and Fancy Plot

Exploring GDP versus life expectancy in 2007 with highlighting the larger countries filter our data.

In [None]:
# Advanced customized bubble plot

advanced_bubble_plot = (
    ggplot(gm_2007)
    + geom_point(aes(x='gdpPercap', y='lifeExp', color='continent', size='pop'), alpha=0.5)
    + geom_text(aes(x='gdpPercap', y='lifeExp', label='country'), color='gray', 
                data=gm_2007[(gm_2007['pop'] > 1000000000) | (gm_2007['country'].isin(['Nigeria', 'United States']))])
    + scale_x_log10(limits=(200, 60000))
    + labs(title='GDP versus life expectancy in 2007',
           x='GDP per capita (log scale)',
           y='Life expectancy',
           size='Population',
           color='Continent')
    + scale_size(range=[0.1, 10])
    + guides(size='none') # Remove the legend for the size variable
    + theme_classic()
    + theme(legend_position='top',
            axis_line=element_line(color='#D3D3D3'),
            axis_ticks=element_line(color='#D3D3D3'))
    + guides(color='legend') # Add a legend for the color variable
)

advanced_bubble_plot.show()