<a href="https://colab.research.google.com/github/JordanDCunha/R-for-Data-Science-2e-/blob/main/Chapter_9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üé® **9.1 Introduction**

In **Chapter 1**, you learned much more than just how to make **scatterplots**, **bar charts**, and **boxplots**. You built a **foundation** that allows you to create *any* type of plot using **ggplot2**.

In this chapter, you‚Äôll expand that foundation by learning about the **layered grammar of graphics**, which is the core idea behind how ggplot2 works.

---

## üß± What You‚Äôll Learn in This Chapter

You‚Äôll take a deeper dive into the key building blocks of ggplot2:

- **Aesthetic mappings** ‚Äî how variables are mapped to visual properties  
- **Geometric objects (geoms)** ‚Äî the shapes that appear on your plot  
- **Facets** ‚Äî how to split data into multiple panels  

You‚Äôll also learn about what happens *behind the scenes*:

- **Statistical transformations**  
  - Used to compute new values, such as:
    - Bar heights in a **bar plot**
    - Medians in a **box plot**
- **Position adjustments**  
  - Control how geoms are arranged and displayed  
- **Coordinate systems**  
  - Define how data values are mapped onto the plot space  

---

## üîç Scope of the Chapter

This chapter won‚Äôt cover **every possible option** for each layer. Instead, it focuses on:

- The **most important and commonly used features** of ggplot2  
- Core ideas that will let you confidently explore the rest on your own  
- A brief introduction to **extension packages** that build on ggplot2  

By the end, you‚Äôll understand *how* ggplot2 thinks ‚Äî not just *how* to use it.

---

## üì¶ **9.1.1 Prerequisites**

This chapter focuses entirely on **ggplot2**. To access the datasets, help pages, and functions used throughout the chapter, load the **tidyverse**.


In [None]:
library(tidyverse)


# üé® **9.2 Aesthetic Mappings**

> **‚ÄúThe greatest value of a picture is when it forces us to notice what we never expected to see.‚Äù**  
> ‚Äî *John Tukey*

The **`mpg`** dataset bundled with **ggplot2** contains **234 observations** on **38 car models**. It‚Äôs a go-to dataset for learning how aesthetics work in **ggplot2**.

---

## üöó The `mpg` Dataset at a Glance

Some key variables we‚Äôll use:

- **`displ`** ‚Äî Engine size (liters), *numerical*
- **`hwy`** ‚Äî Highway fuel efficiency (mpg), *numerical*
- **`class`** ‚Äî Type of car, *categorical*

Our goal is to visualize the relationship between **engine size** and **fuel efficiency**, while using **aesthetic mappings** to represent additional information.

---

## üñåÔ∏è What Are Aesthetic Mappings?

**Aesthetic mappings** describe how variables in your data are translated into visual properties of a plot, such as:

- **x** and **y** position  
- **color**
- **shape**
- **size**
- **alpha** (transparency)

Mappings are defined inside `aes()` and tell ggplot2 *what the data represents*, not just how the plot looks.

---

## üéØ Mapping Data to Aesthetics

When you map:
- **Numerical variables** ‚Üí `x` and `y`, you get axes
- **Categorical variables** ‚Üí `color` or `shape`, you get legends

ggplot2 automatically:
- Chooses appropriate scales
- Builds legends
- Labels axes

No extra work needed.

---

## ‚ö†Ô∏è Common Warnings (and Why They Happen)

### Shape
- ggplot2 supports **at most 6 distinct shapes**
- If you map a variable with more levels, extra groups are dropped

### Size & Alpha
- These aesthetics imply **ordering**
- Mapping an unordered categorical variable (like `class`) can be misleading
- ggplot2 warns you for a reason ‚Äî listen to it

---

## üé® Mapping vs. Setting Aesthetics

There‚Äôs an important distinction:

- **Mapping** (inside `aes()`) ‚Üí shows information
- **Setting** (outside `aes()`) ‚Üí just changes appearance

Example:
- Mapping `color = class` explains something
- Setting `color = "blue"` is purely decorative

---

## üî∫ Shapes, Fill, and Stroke

R has **26 built-in shapes**, each behaving differently with:
- **color** (outline)
- **fill** (interior)
- **stroke** (outline thickness)

Only shapes **21‚Äì25** support both `fill` and `color`.

---

## üß† Key Takeaways

- Use aesthetics to **encode data**, not decoration
- Avoid mapping categorical variables to ordered aesthetics
- Let ggplot2 handle scales and legends
- Choose geoms and aesthetics that make your story clear

In the next section, we‚Äôll dig deeper into **geometric objects (geoms)** and how they control what actually appears on the plot.


In [None]:
library(tidyverse)

# Color mapped to class
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point()

# Shape mapped to class (will trigger warnings)
ggplot(mpg, aes(x = displ, y = hwy, shape = class)) +
  geom_point()

# Size mapped to class (not advised)
ggplot(mpg, aes(x = displ, y = hwy, size = class)) +
  geom_point()

# Alpha mapped to class (not advised)
ggplot(mpg, aes(x = displ, y = hwy, alpha = class)) +
  geom_point()

# Manually setting aesthetics: pink filled triangles
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(shape = 24, fill = "pink", color = "pink")


# üìê **9.3 Geometric Objects**

Geometric objects, or **geoms**, determine *how* data are displayed in a ggplot. While aesthetics describe **what** variables are mapped to visual properties, geoms define **what shape** those mappings take.

Two plots can visualize the **same data** with the **same variables**, yet look very different if they use different geoms. For example, a scatterplot and a smooth curve may both show the relationship between engine size and highway mileage‚Äîbut they emphasize different features of the data.

---

## üîß Changing the Geom Changes the Story

- **`geom_point()`** shows individual observations
- **`geom_smooth()`** summarizes trends by fitting a model to the data

Each geom:
- Accepts aesthetic mappings
- Ignores aesthetics it doesn‚Äôt know how to use
- May represent **many rows of data with a single object**

---

## üß© Grouping and Discrete Variables

Some geoms (like `geom_smooth()`) draw one object per group. When you map a **categorical variable** to an aesthetic such as `color` or `linetype`, ggplot2 automatically groups the data and draws multiple objects‚Äîone for each category.

If you want grouping *without* adding visual distinction or a legend, you can use the **`group`** aesthetic directly.

---

## üß± Layering Multiple Geoms

A single plot can contain multiple geoms, each:
- Using different aesthetics
- Using different data
- Highlighting different subsets of observations

Local mappings and data inside a geom override the global settings for that layer only, making layered visualizations both powerful and flexible.

---

## üìä Choosing the Right Geom

Different geoms reveal different features of the same variable:

- **Histogram / Density plot** ‚Üí distribution shape
- **Boxplot** ‚Üí spread and outliers
- **Smooth curve** ‚Üí overall trend

No single geom is ‚Äúbest‚Äù‚Äîthe right choice depends on the question you‚Äôre asking.

---

## üß† Key Takeaways

- Geoms are the **core building blocks** of ggplot2
- The same data can tell very different stories with different geoms
- Grouping happens automatically when mapping discrete aesthetics
- Layering geoms lets you combine raw data with summaries
- ggplot2 can be extended with new geoms via external packages

Mastering geoms means mastering how your data speaks visually.


In [None]:
library(tidyverse)
library(ggridges)

# Point vs smooth
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point()

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_smooth()

# Aesthetic compatibility: shape ignored, linetype works
ggplot(mpg, aes(x = displ, y = hwy, shape = drv)) +
  geom_smooth()

ggplot(mpg, aes(x = displ, y = hwy, linetype = drv)) +
  geom_smooth()

# Overlaying geoms with color and linetype
ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
  geom_point() +
  geom_smooth(aes(linetype = drv))

# Grouping without legend
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_smooth(aes(group = drv))

# Local aesthetics and layers
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth()

# Highlighting a subset with separate data
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  geom_point(
    data = mpg |> filter(class == "2seater"),
    color = "red"
  ) +
  geom_point(
    data = mpg |> filter(class == "2seater"),
    shape = "circle open",
    size = 3,
    color = "red"
  )

# Different geoms for one variable
ggplot(mpg, aes(x = hwy)) +
  geom_histogram(binwidth = 2)

ggplot(mpg, aes(x = hwy)) +
  geom_density()

ggplot(mpg, aes(x = hwy)) +
  geom_boxplot()

# Extension geom: ridgeline plot
ggplot(mpg, aes(x = hwy, y = drv, fill = drv, color = drv)) +
  geom_density_ridges(alpha = 0.5, show.legend = FALSE)


# üß© **9.4 Facets**

**Faceting** is a powerful technique in ggplot2 that lets you split a single plot into multiple subplots, each showing a different subset of the data. This is especially useful when you want to compare patterns across levels of one or more categorical variables.

---

## ü™ü Faceting with One Variable

You can split a plot by a single categorical variable using **`facet_wrap()`**. Each level of the variable gets its own panel.

- Panels are arranged automatically
- Best for one faceting variable
- Layout can be customized with rows or columns

---

## üßÆ Faceting with Two Variables

To facet by the **combination of two variables**, use **`facet_grid()`**.  
This creates a matrix of plots defined by **rows ~ columns**.

- Rows represent one variable
- Columns represent another
- Empty panels indicate combinations that don‚Äôt exist in the data

---

## üìè Controlling Scales

By default, all facets share the same x and y scales, making comparisons easier across panels.  
However, this can sometimes hide patterns within individual facets.

You can change this behavior using the `scales` argument:

- `"free_x"` ‚Üí different x scales
- `"free_y"` ‚Üí different y scales
- `"free"` ‚Üí both axes vary

---

## üß† Why Use Facets?

**Advantages**
- Clear comparisons across groups
- Avoids overplotting and clutter
- No reliance on color or shape distinctions

**Disadvantages**
- Takes up more space
- Harder to compare exact values across many panels
- Less effective with too many facet levels

As datasets grow larger or more complex, faceting often becomes more effective than mapping groups to aesthetics like color.

---

## üß≠ Rows vs Columns Matter

Whether you facet across **rows** or **columns** affects how easily viewers compare distributions or trends.  
Choosing the right orientation can dramatically improve readability and insight.

Faceting isn‚Äôt just about splitting plots‚Äîit‚Äôs about structuring comparisons thoughtfully.


In [None]:
library(tidyverse)

# Facet by one variable
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_wrap(~ cyl)

# Facet by two variables
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_grid(drv ~ cyl)

# Free scales
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_grid(drv ~ cyl, scales = "free")

# Understanding empty cells
ggplot(mpg) +
  geom_point(aes(x = drv, y = cyl))

# Using . in facet_grid
ggplot(mpg) +
  geom_point(aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)

ggplot(mpg) +
  geom_point(aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)

# facet_wrap layout control
ggplot(mpg) +
  geom_point(aes(x = displ, y = hwy)) +
  facet_wrap(~ cyl, nrow = 2)

# Comparing histograms across facets
ggplot(mpg, aes(x = displ)) +
  geom_histogram() +
  facet_grid(drv ~ .)

ggplot(mpg, aes(x = displ)) +
  geom_histogram() +
  facet_grid(. ~ drv)

# Recreating facet_grid with facet_wrap
ggplot(mpg) +
  geom_point(aes(x = displ, y = hwy)) +
  facet_wrap(~ drv)


# üìä **9.5 Statistical Transformations**

Not every plot displays raw values directly from a dataset. Some plots‚Äîlike **bar charts**, **histograms**, **boxplots**, and **smooth curves**‚Äîfirst **compute new values** before drawing anything. These computations are called **statistical transformations**, or **stats**.

---

## üî¢ Where Do the Numbers Come From?

Consider a basic bar chart showing the number of diamonds by cut. The x-axis uses `cut`, which exists in the dataset‚Äîbut the y-axis shows **counts**, which are *not* stored anywhere in the data. Instead, ggplot2 **calculates** them for you.

Different plot types rely on different statistical transformations:

- **Bar charts / histograms**  
  ‚Üí Bin data and count observations  
- **Smoothers**  
  ‚Üí Fit a model and plot predictions  
- **Boxplots**  
  ‚Üí Compute the five-number summary  

This process always follows the same idea:

> **Raw data ‚Üí statistical transformation ‚Üí visual mapping**

---

## üîÅ Geoms and Stats Work in Pairs

Every **geom** has a **default stat**, and every **stat** has a **default geom**.  
Because of this pairing, you usually don‚Äôt need to think about stats explicitly‚Äîbut sometimes you should.

---

## üõ†Ô∏è When to Use a Stat Explicitly

You might specify a stat directly when:

1. **Overriding the default stat**  
   Example: using precomputed values instead of counts

2. **Changing how computed values map to aesthetics**  
   Example: plotting proportions instead of raw counts

3. **Making the transformation explicit for clarity**  
   Example: clearly showing a summary statistic being computed

---

## üìà Computed Variables

Some stats generate new variables like `count`, `prop`, or model predictions.  
You can access these using `after_sta_


In [None]:
library(tidyverse)

# Default bar chart (uses stat_count)
ggplot(diamonds, aes(x = cut)) +
  geom_bar()

# Using precomputed counts with stat = "identity"
diamonds |>
  count(cut) |>
  ggplot(aes(x = cut, y = n)) +
  geom_bar(stat = "identity")

# Bar chart of proportions
ggplot(diamonds, aes(x = cut, y = after_stat(prop), group = 1)) +
  geom_bar()

# Explicit statistical summary
ggplot(diamonds) +
  stat_summary(
    aes(x = cut, y = depth),
    fun.min = min,
    fun.max = max,
    fun = median
  )

# Problematic proportion mappings
ggplot(diamonds, aes(x = cut, y = after_stat(prop))) +
  geom_bar()

ggplot(diamonds, aes(x = cut, fill = color, y = after_stat(prop))) +
  geom_bar()


# üìê **9.6 Position Adjustments**

Position adjustments control **how geoms are placed relative to each other** when multiple observations would otherwise overlap. They are especially important for **bar charts** and **scatterplots**, where overlapping elements can hide patterns in the data.

---

## üü¶ Coloring Bars: `color` vs `fill`

- `color` affects the **outline** of bars  
- `fill` affects the **interior**, and is usually more informative  

When `fill` is mapped to a second variable, bars are **stacked automatically**. Each stack represents a combination of variables.

---

## üîÑ Common Position Adjustments

ggplot2 handles overlaps using the `position` argument:

### üîπ `position = "stack"` (default for bars)
- Stacks values on top of each other
- Good for showing totals

### üîπ `position = "fill"`
- Like stacking, but rescales bars to the same height
- Best for comparing **proportions**

### üîπ `position = "dodge"`
- Places bars **side by side**
- Best for comparing individual group values

### üîπ `position = "identity"`
- Draws objects exactly where they fall
- Often causes overlap (use transparency or outlines)

---

## üéØ Overplotting and Jittering

Scatterplots can suffer from **overplotting** when many points share the same x and y values. This hides the true distribution.

- `position = "jitter"` adds small random noise
- Reveals density patterns without changing the overall structure
- `geom_jitter()` is a shortcut for this adjustment

Although jitter adds randomness, it often makes large-scale patterns **much clearer**.

---

## üß† Key Takeaways

- Position adjustments change *how* data is displayed, not *what* data is shown
- Bars default to stacking; points default to identity
- Jitter is a powerful fix for overplotting
- Choosing the right adjustment can dramatically improve interpretability

Understanding position adjustments helps you control clutter, reveal structure, and tell clearer stories with your data.


In [None]:
library(tidyverse)

# Color vs fill in bar charts
ggplot(mpg, aes(x = drv, color = drv)) +
  geom_bar()

ggplot(mpg, aes(x = drv, fill = drv)) +
  geom_bar()

# Stacked bars (default)
ggplot(mpg, aes(x = drv, fill = class)) +
  geom_bar()

# Identity position with transparency / outlines
ggplot(mpg, aes(x = drv, fill = class)) +
  geom_bar(alpha = 1/5, position = "identity")

ggplot(mpg, aes(x = drv, color = class)) +
  geom_bar(fill = NA, position = "identity")

# Fill and dodge
ggplot(mpg, aes(x = drv, fill = class)) +
  geom_bar(position = "fill")

ggplot(mpg, aes(x = drv, fill = class)) +
  geom_bar(position = "dodge")

# Overplotting vs jitter
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point()

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(position = "jitter")

# Shorthand
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_jitter()

# Default boxplot position adjustment
ggplot(mpg, aes(x = drv, y = hwy)) +
  geom_boxplot()


# üß≠ **9.7 Coordinate Systems**

Coordinate systems control **how x and y positions are interpreted and drawn** in ggplot2. While they don‚Äôt change the data itself, they can dramatically change how the data is perceived.

---

## üìê Cartesian Coordinates (Default)

By default, ggplot2 uses **Cartesian coordinates**, where x and y act independently. This works well for most plots like scatterplots, bar charts, and line graphs.

---

## üó∫Ô∏è `coord_quickmap()` ‚Äî For Maps

When plotting geographic data, using the default Cartesian system can **distort distances and shapes**.  
`coord_quickmap()` fixes the aspect ratio so longitude and latitude are displayed correctly.

- Essential for spatial data
- Preserves relative distances
- Faster and simpler than `coord_map()`

---

## üßø `coord_polar()` ‚Äî Polar Coordinates

`coord_polar()` converts Cartesian coordinates into **polar coordinates**, where:
- x becomes the **angle**
- y becomes the **radius**

This reveals the relationship between:
- **Bar charts**
- **Coxcomb charts**
- **Pie charts** (stacked bars + polar coordinates)

Simply changing the coordinate system can transform the entire interpretation of a plot.

---

## üîÑ Coordinate Transformations Matter

- Coordinates affect **geometry, angles, and slopes**
- They are applied **after** statistical transformations
- Used for maps, circular charts, and enforcing equal scaling

Understanding coordinate systems helps you choose visuals that best reflect the structure of your data.


In [None]:
library(tidyverse)
library(maps)

# Geographic example
nz <- map_data("nz")

ggplot(nz, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill = "white", color = "black")

ggplot(nz, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill = "white", color = "black") +
  coord_quickmap()

# Bar chart transformed with coordinates
bar <- ggplot(diamonds) +
  geom_bar(
    aes(x = clarity, fill = clarity),
    show.legend = FALSE,
    width = 1
  ) +
  theme(aspect.ratio = 1)

bar + coord_flip()
bar + coord_polar()

# City vs highway mpg with fixed coordinates
ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point() +
  geom_abline() +
  coord_fixed()
