<a href="https://colab.research.google.com/github/JordanDCunha/R-for-Data-Science-2e-/blob/main/Chapter_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Visualization with ggplot2

## Introduction

John Tukey famously said, *“The simple graph has brought more information to the data analyst’s mind than any other device.”*  
This highlights the importance of data visualization in understanding and communicating patterns in data.

R provides multiple systems for creating graphics, but **ggplot2** is one of the most powerful and flexible. It is based on the **Grammar of Graphics**, a structured approach to building plots by combining components such as data, aesthetic mappings, and geometric objects.

Learning ggplot2 allows you to create a wide range of visualizations efficiently using a single, consistent framework.

In this chapter, we:
- Begin with a basic scatterplot
- Introduce **aesthetic mappings** and **geometric objects**
- Explore visualizations for single-variable distributions
- Examine relationships between two or more variables
- Conclude with saving plots and troubleshooting tips

## Prerequisites

This chapter relies heavily on **ggplot2**, which is part of the **tidyverse**—a collection of R packages commonly used in data analysis.

The tidyverse includes tools for:
- Data manipulation (`dplyr`, `tidyr`)
- Data visualization (`ggplot2`)
- Data import (`readr`)
- Working with strings, dates, and factors

Additional packages used in this chapter:
- **palmerpenguins**: Provides the `penguins` dataset with body measurements for penguins from the Palmer Archipelago
- **ggthemes**: Supplies additional themes and colorblind-friendly palettes for plots

Packages must be **installed once**, but **loaded every session** before use.


In [None]:
# Load the tidyverse (includes ggplot2 and other core packages)
library(tidyverse)

# If tidyverse is not installed, run this first:
# install.packages("tidyverse")

# Load the penguins dataset
library(palmerpenguins)

# Load additional plotting themes and color palettes
library(ggthemes)


## 1.2 First Steps

This section introduces ggplot2 by exploring the relationship between two numerical variables:
**flipper length** and **body mass** in penguins.

The main questions guiding the analysis are:
- Do penguins with longer flippers weigh more?
- Is the relationship positive or negative?
- Is it linear or nonlinear?
- Does the relationship vary by species or island?

Visualization is used to make these relationships precise and interpretable.

---

### 1.2.1 The `penguins` Data Frame

The `penguins` data frame (from the `palmerpenguins` package) contains **344 observations** and **8 variables**.  
Each row represents a single penguin, and each column represents a measured attribute.

Key definitions:
- **Variable**: A measurable attribute (e.g., flipper length)
- **Value**: A recorded measurement of a variable
- **Observation**: All measurements for one penguin
- **Tidy data**: Each variable in its own column, each observation in its own row

The data is stored as a **tibble**, which is a tidyverse-friendly version of a data frame.

Important variables used in this section:
- `species`: Penguin species (Adelie, Chinstrap, Gentoo)
- `flipper_length_mm`: Flipper length in millimeters
- `body_mass_g`: Body mass in grams

---

### 1.2.2 Ultimate Goal

The goal is to create a visualization that shows:
- The relationship between flipper length and body mass
- Differences between species
- A single overall trend line
- Clear, accessible labeling

---

### 1.2.3 Creating a ggplot

A ggplot is built in layers:
1. **ggplot()** initializes the plot and specifies the dataset
2. **aes()** maps variables to visual aesthetics (x, y, color, shape)
3. **geom_*** functions define how data points are displayed

A scatterplot is created using `geom_point()`.

Missing values generate warnings because ggplot2 does not silently drop data. This makes missing data issues explicit.

---

### 1.2.4 Adding Aesthetics and Layers

Additional variables can be incorporated using aesthetics:
- Mapping `species` to `color` and `shape` reveals group differences
- ggplot2 automatically scales aesthetics and adds legends

Trend lines are added with `geom_smooth(method = "lm")`.

Important distinction:
- **Global aesthetics** (defined in `ggplot()`) apply to all layers
- **Local aesthetics** (defined inside a `geom`) apply only to that layer

This allows points to be colored by species while keeping a single regression line.

---

### Accessibility and Labels

Good visualizations:
- Do not rely on color alone (use shapes as well)
- Include informative titles, subtitles, and axis labels
- Use colorblind-friendly palettes when possible

The `labs()` function improves clarity, and `scale_color_colorblind()` improves accessibility.

---

### Key Takeaways

- Scatterplots are ideal for relationships between two numerical variables
- Aesthetics control *how* data is displayed
- Geoms control *what* is drawn
- Layering allows complex plots to be built step-by-step


In [None]:
# Inspect the data
penguins
glimpse(penguins)

# Initialize an empty plot
ggplot(data = penguins)

# Map flipper length and body mass to axes
ggplot(
  data = penguins,
  aes(x = flipper_length_mm, y = body_mass_g)
)

# Create a scatterplot
ggplot(
  data = penguins,
  aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point()

# Color points by species
ggplot(
  data = penguins,
  aes(x = flipper_length_mm, y = body_mass_g, color = species)
) +
  geom_point()

# Add linear trend lines by species
ggplot(
  data = penguins,
  aes(x = flipper_length_mm, y = body_mass_g, color = species)
) +
  geom_point() +
  geom_smooth(method = "lm")

# Use local aesthetics to keep a single trend line
ggplot(
  data = penguins,
  aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point(aes(color = species, shape = species)) +
  geom_smooth(method = "lm")

# Final polished visualization
ggplot(
  data = penguins,
  aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point(aes(color = species, shape = species)) +
  geom_smooth(method = "lm") +
  labs(
    title = "Body mass and flipper length",
    subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
    x = "Flipper length (mm)",
    y = "Body mass (g)",
    color = "Species",
    shape = "Species"
  ) +
  scale_color_colorblind()


## 1.3 ggplot2 Calls

As you become more comfortable with ggplot2, the code used to create plots becomes more concise.

Earlier examples used explicit argument names to make learning easier. In particular, `ggplot()` was written with named arguments:
- `data` for the dataset
- `mapping` for aesthetic mappings created with `aes()`

The first two arguments of `ggplot()` are always `data` and `mapping`, so their names are usually omitted. This reduces typing and makes it easier to focus on what changes from one plot to another.

This concise style is preferred and is used throughout the rest of the book.

---

### Concise ggplot Syntax

Instead of explicitly naming arguments, ggplot calls are typically written using positional arguments. This produces the same plot while improving readability.

---

### Using the Pipe Operator

The pipe operator (`|>`) allows data to be passed directly into `ggplot()`. This emphasizes a workflow style of programming:
- Start with a dataset
- Then apply plotting functions

This approach becomes especially useful when plots are created after data manipulation steps.

---

### Key Takeaways

- The first two arguments of `ggplot()` are `data` and `mapping`
- Argument names are often omitted in practice
- Concise code is easier to read and compare
- The pipe operator integrates ggplot2 into tidy workflows


In [None]:
# Explicit version
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point()

# Concise version
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point()

# Using the pipe operator
penguins |>
  ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point()
w

## 1.4 Visualizing Distributions

The way a distribution is visualized depends on the type of variable:
- **Categorical variables** are best summarized by counts.
- **Numerical variables** are best summarized by the shape of their distribution.

---

### 1.4.1 Visualizing a Categorical Variable

A variable is **categorical** if it can take only one of a limited number of values.  
To visualize the distribution of a categorical variable, a **bar chart** is commonly used.

In a bar chart:
- The x-axis shows the categories
- The height of each bar represents the number of observations in that category

For categorical variables with unordered levels, it is often helpful to **reorder bars by frequency**. This makes comparisons easier and highlights the most common categories.

Reordering requires converting the variable into a factor and rearranging its levels.

---

### 1.4.2 Visualizing a Numerical Variable

A variable is **numerical** if it can take a wide range of numeric values and arithmetic operations such as averaging are meaningful. Numerical variables can be continuous or discrete.

A common way to visualize the distribution of a numerical variable is a **histogram**.

Histograms:
- Divide the x-axis into equally spaced bins
- Count how many observations fall into each bin
- Depend heavily on the chosen bin width

Choosing an appropriate bin width is important:
- Too small: too many bars, noisy appearance
- Too large: too few bars, important structure hidden
- A moderate bin width often best reveals the shape of the distribution

---

### Density Plots

A **density plot** is a smoothed version of a histogram.
- It shows the overall shape of the distribution
- It is useful for identifying skewness and modes
- It sacrifices detail for smoothness

Density plots are especially useful for continuous data believed to come from an underlying smooth distribution.

Warnings may appear if observations contain missing values, as these cannot be plotted.

---

### Key Takeaways

- Bar charts visualize counts of categorical variables
- Histograms visualize distributions of numerical variables
- Bin width strongly affects histogram interpretation
- Density plots provide a smooth alternative to histograms


In [None]:
# Bar plot of a categorical variable
ggplot(penguins, aes(x = species)) +
  geom_bar()

# Reorder bars by frequency
ggplot(penguins, aes(x = fct_infreq(species))) +
  geom_bar()

# Histogram with a reasonable bin width
ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram(binwidth = 200)

# Histogram with too small bin width
ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram(binwidth = 20)

# Histogram with too large bin width
ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram(binwidth = 2000)

# Density plot
ggplot(penguins, aes(x = body_mass_g)) +
  geom_density()


## 1.5 Visualizing Relationships

To visualize a **relationship**, we must map **at least two variables** to aesthetics in a plot. Different combinations of variable types (categorical vs numerical) require different geoms.

---

### 1.5.1 One Numerical and One Categorical Variable

A common way to visualize the relationship between a **numerical** and a **categorical** variable is a **side-by-side boxplot**.

A boxplot summarizes a distribution using:
- The **interquartile range (IQR)**: from the 25th to the 75th percentile
- The **median** (50th percentile): shown as a line inside the box
- **Whiskers**: extend to the most extreme non-outlier values
- **Outliers**: points beyond 1.5 × IQR from the box

Boxplots allow comparison of distributions across categories and help identify skewness and outliers.

---

### Density plots by category

Instead of boxplots, we can compare distributions using **density plots**, especially useful for continuous variables.  
Mapping a categorical variable to `color` (and optionally `fill`) allows multiple distributions to be overlaid.

- Mapping = aesthetics vary by data
- Setting = aesthetics fixed to a constant value

The `alpha` aesthetic controls transparency (0 = transparent, 1 = opaque).

---

### 1.5.2 Two Categorical Variables

To visualize the relationship between **two categorical variables**, we commonly use **stacked bar plots**.

- Default stacked bars show **counts**
- Using `position = "fill"` shows **proportions**

Relative frequency plots are better for comparing category composition when group sizes differ.

The y-axis label can be overridden using `labs()`.

---

### 1.5.3 Two Numerical Variables

The standard visualization for two numerical variables is a **scatterplot**.

Scatterplots reveal:
- Direction (positive/negative association)
- Strength of relationship
- Clusters and outliers

---

### 1.5.4 Three or More Variables

Additional variables can be incorporated by:
- Mapping them to aesthetics (color, shape, size)
- Using **faceting** to create subplots

Faceting is especially effective for categorical variables and avoids cluttered plots.

`facet_wrap()` uses a formula (`~variable`) and splits the data into panels.

---

## 1.5.5 Exercises

### 1. mpg dataset: variable types
Categorical variables include:
- `manufacturer`, `model`, `trans`, `drv`, `fl`, `class`

Numerical variables include:
- `displ`, `year`, `cyl`, `cty`, `hwy`

You can see this by:
- Running `?mpg`
- Using `str(mpg)` to inspect variable types

---

### 2. Scatterplot aesthetics behavior
- **Color & size (numerical)**: produce continuous scales
- **Color & shape (categorical)**: produce discrete legends
- **Shape** cannot represent many numerical values meaningfully

---

### 3. Mapping linewidth
Mapping a variable to `linewidth` in a scatterplot has little effect and is not recommended for points. Linewidth is better suited for line geoms.

---

### 4. Mapping the same variable to multiple aesthetics
This reinforces group distinctions but can also increase visual redundancy or clutter.

---

### 5. Coloring and faceting by species
Coloring reveals that each species forms a distinct cluster in bill measurements.  
Faceting further clarifies within-species relationships by separating the groups entirely.

---

### 6. Duplicate legends
Two legends appear because `color` and `shape` are treated as separate aesthetics.  
They can be combined by giving them the same name using `labs()` for both aesthetics.

---

### 7. Interpreting stacked bar plots
- `x = island`: answers *“What is the species composition within each island?”*
- `x = species`: answers *“How are islands represented within each species?”*


In [None]:
# Boxplot: numerical vs categorical
ggplot(penguins, aes(x = species, y = body_mass_g)) +
  geom_boxplot()

# Density plots by species (color only)
ggplot(penguins, aes(x = body_mass_g, color = species)) +
  geom_density(linewidth = 0.75)

# Density plots with color, fill, and transparency
ggplot(penguins, aes(x = body_mass_g, color = species, fill = species)) +
  geom_density(alpha = 0.5)

# Stacked bar plot (counts)
ggplot(penguins, aes(x = island, fill = species)) +
  geom_bar()

# Stacked bar plot (proportions)
ggplot(penguins, aes(x = island, fill = species)) +
  geom_bar(position = "fill") +
  labs(y = "proportion")

# Scatterplot: two numerical variables
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point()

# Scatterplot with additional aesthetics
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species, shape = island))

# Faceted scatterplot
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species, shape = species)) +
  facet_wrap(~island)

# mpg scatterplot
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point()

# bill measurements colored by species
ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
  geom_point()

# Faceted by species
ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point() +
  facet_wrap(~species)

# Fixing duplicate legends
ggplot(
  data = penguins,
  mapping = aes(
    x = bill_length_mm,
    y = bill_depth_mm,
    color = species,
    shape = species
  )
) +
  geom_point() +
  labs(color = "Species", shape = "Species")

# Stacked bar plots for comparison
ggplot(penguins, aes(x = island, fill = species)) +
  geom_bar(position = "fill")

ggplot(penguins, aes(x = species, fill = island)) +
  geom_bar(position = "fill")


## 1.6 Saving Your Plots

After creating a plot, you may want to save it as an image file so it can be used outside of R. This is done with the `ggsave()` function.

By default, `ggsave()` saves the **most recently created plot** to disk. If no file path is specified, the plot is saved to the **current working directory**.

If `width` and `height` are not specified, `ggsave()` uses the dimensions of the current plotting device. For **reproducible code**, it is best practice to always specify these values.

Although `ggsave()` is useful, final reports are typically created using **Quarto**, which allows code and text to be combined in a single, reproducible document with plots included automatically.

---

## 1.6.1 Exercises

### Which plot is saved?

The second plot is saved because `ggsave()` always saves the **last plot that was drawn**.

---

### Saving as a PDF

To save a plot as a PDF instead of a PNG, change the file extension to `.pdf`.

---

### Supported file types

To find which image formats are supported by `ggsave()`, consult the function documentation using `?ggsave`.


In [None]:
# Create a scatterplot
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point()

# Save the most recently created plot
ggsave(filename = "penguin-plot.png", width = 6, height = 4)

# Exercise example: two plots
ggplot(mpg, aes(x = class)) +
  geom_bar()

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point()

# This saves the second plot
ggsave("mpg-plot.png")

# Saving as a PDF instead of PNG
ggsave("mpg-plot.pdf")

# View documentation for supported formats
?ggsave
