<a href="https://colab.research.google.com/github/JordanDCunha/R-for-Data-Science-2e-/blob/main/Chapter_16.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üß© 16.1 Introduction ‚Äî Factors in R

**Factors** are R‚Äôs way of working with **categorical variables**: variables that have a fixed, known set of possible values (like gender, education level, or survey responses). They‚Äôre especially important in data analysis because they:

- Preserve the *set* of allowed values (levels)
- Control the *ordering* of categories (not just alphabetical)
- Interact correctly with modeling and plotting functions

In this chapter, you‚Äôll:
- Learn **why factors matter** (and when characters aren‚Äôt enough)
- Create factors with `factor()`
- Work with real categorical data using the **`gss_cat`** dataset
- Reorder, relabel, and collapse factor levels
- Understand **ordered factors**, where category order has meaning

---

## üì¶ Prerequisites

Base R includes basic factor tools, but we‚Äôll mostly use **forcats**, a tidyverse package designed specifically for factor manipulation (and yes, it‚Äôs an anagram of *factors* üê±).

`forcats` makes it much easier to:
- Reorder levels
- Rename or lump categories
- Handle missing or rare values

We‚Äôll load it via the tidyverse.


In [None]:
library(tidyverse)


## 16.2 Factor basics

Using **strings** to represent categorical data (like months) causes two common problems:

1. **Invalid values** can sneak in (e.g. `"Jam"` instead of `"Jan"`).
2. **Sorting is unhelpful**, because strings sort alphabetically, not logically.

**Factors** solve both issues by:
- Restricting values to a predefined set of **levels**
- Enforcing a meaningful **order**

### Key ideas
- `factor(x, levels = ...)` creates a factor with an explicit level order
- Values not listed in `levels` become `NA`
- `forcats::fct()` is safer than `factor()` because it errors on invalid values
- If you don‚Äôt specify levels:
  - `factor()` sorts alphabetically
  - `fct()` orders levels by **first appearance**
- Use `levels()` to inspect valid values
- You can define factors at **data import time** with `readr::col_factor()`


In [None]:
library(tidyverse)

# Character vector of months
x1 <- c("Dec", "Apr", "Jan", "Mar")
x2 <- c("Dec", "Apr", "Jam", "Mar")

# Explicit month order
month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun",
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)

# Create factors
y1 <- factor(x1, levels = month_levels)
sort(y1)

# Invalid values become NA
y2 <- factor(x2, levels = month_levels)

# Safer alternative (errors on invalid levels)
fct(x2, levels = month_levels)

# Default behavior
factor(x1)  # alphabetical
fct(x1)     # order of appearance

# Inspect levels
levels(y1)

# Create factors during data import
csv <- "
month,value
Jan,12
Feb,56
Mar,12
"

df <- read_csv(
  I(csv),
  col_types = cols(month = col_factor(month_levels))
)

df$month


## 16.3 General Social Survey (gss_cat)

For the rest of the chapter, we use **`forcats::gss_cat`**, a sample from the US General Social Survey. It contains several **factor variables** that are perfect for practicing common factor tasks.

### Key ideas
- Factors inside a tibble don‚Äôt show their levels directly
- `count()` is an easy way to explore factor distributions
- Two very common factor tasks:
  1. **Reordering levels** (often for better plots)
  2. **Inspecting or collapsing levels** to understand categories

### What to look for in the exercises
- **`rincome`** has many long, uneven labels ‚Üí default bar charts are cluttered  
  ‚Üí improve by reordering by frequency and flipping coordinates
- **Most common categories** can be found with `count(..., sort = TRUE)`
- **`denom` only applies to religious respondents**  
  ‚Üí check with a table or a faceted/filtered visualization


In [None]:
library(tidyverse)
library(forcats)

# View dataset
gss_cat

# 1. Distribution of reported income
gss_cat |>
  count(rincome) |>
  mutate(rincome = fct_reorder(rincome, n)) |>
  ggplot(aes(x = rincome, y = n)) +
  geom_col() +
  coord_flip()

# 2. Most common religion
gss_cat |>
  count(relig, sort = TRUE)

# 3. Most common political party ID
gss_cat |>
  count(partyid, sort = TRUE)

# 4. Which religion does denomination apply to? (table)
gss_cat |>
  count(relig, denom)

# 5. Visualization: denomination by religion
gss_cat |>
  filter(!is.na(denom)) |>
  ggplot(aes(x = denom)) +
  geom_bar() +
  coord_flip() +
  facet_wrap(~ relig)


## 16.4 Modifying factor order

This section is all about **reordering factor levels to make plots readable**, without breaking the meaning of the data.

### Core ideas
- Use **`fct_reorder()`** when factor levels have **no natural order** (e.g., religion, marital status).
- **Do NOT reorder factors with a principled order** (e.g., income ranges, education levels).
- Use **`fct_relevel()`** to manually move special values (like ‚ÄúNot applicable‚Äù) without scrambling meaning.
- For line plots, **`fct_reorder2()`** aligns legend order with the lines at the right edge.
- For bar plots, **`fct_infreq()`** orders levels by frequency (very common + very useful).

### Exercise answers (conceptual)
- **Suspiciously high `tvhours` values**:  
  The mean is *not* a great summary because a few extreme values pull it upward. Median would be more robust.
- **Arbitrary vs principled factor order**:
  - Arbitrary: `relig`, `marital`, `partyid`, `denom`
  - Principled: `rincome` (income ranges have meaning)
- **Why ‚ÄúNot applicable‚Äù moves to the bottom**:  
  In ggplot, the **first factor level appears at the bottom** of a vertical axis. Moving it ‚Äúto the front‚Äù means moving it to the bottom visually.

Takeaway: reorder for **clarity**, not convenience.


In [None]:
library(tidyverse)
library(forcats)

# 1. Is mean tvhours a good summary?
gss_cat |>
  summarize(
    mean_tv = mean(tvhours, na.rm = TRUE),
    median_tv = median(tvhours, na.rm = TRUE),
    max_tv = max(tvhours, na.rm = TRUE)
  )

# 2. Reordering religion by average TV hours (arbitrary order ‚Üí OK)
relig_summary <- gss_cat |>
  group_by(relig) |>
  summarize(
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )

relig_summary |>
  mutate(relig = fct_reorder(relig, tvhours)) |>
  ggplot(aes(x = tvhours, y = relig)) +
  geom_point()

# 3. Income has a principled order ‚Üí DON'T reorder numerically
rincome_summary <- gss_cat |>
  group_by(rincome) |>
  summarize(age = mean(age, na.rm = TRUE))

ggplot(rincome_summary, aes(x = age, y = fct_relevel(rincome, "Not applicable"))) +
  geom_point()

# 4. Line plot with reordered legend using fct_reorder2()
by_age <- gss_cat |>
  filter(!is.na(age)) |>
  count(age, marital) |>
  group_by(age) |>
  mutate(prop = n / sum(n))

ggplot(
  by_age,
  aes(x = age, y = prop, color = fct_reorder2(marital, age, prop))
) +
  geom_line(linewidth = 1) +
  scale_color_brewer(palette = "Set1") +
  labs(color = "marital")

# 5. Bar plot ordered by frequency
gss_cat |>
  mutate(marital = marital |> fct_infreq() |> fct_rev()) |>
  ggplot(aes(x = marital)) +
  geom_bar()


## 16.5 Modifying factor levels

This section focuses on **changing the values of factor levels**, which is even more powerful than reordering them. The main goals are to make labels clearer, combine categories sensibly, and simplify plots or summaries.

### Core tools
- **`fct_recode()`**: rename factor levels (new name on the left, old name on the right). Unmentioned levels are left unchanged.
- **`fct_collapse()`**: collapse many old levels into fewer, broader categories.
- **`fct_lump_*()`** family: automatically lump small groups into `"Other"` based on frequency, count, or proportion.

### Exercise answers (conceptual)
1. **Party identification over time**  
   The proportions of Democrats, Republicans, and Independents shift gradually across years, with Independents generally growing over time and Democrats/Republicans fluctuating. A grouped summary by `year` and collapsed `partyid` is the right approach.

2. **Collapsing `rincome`**  
   You could collapse income into broader bins such as *Low*, *Middle*, *High*, and *Not applicable* using `fct_collapse()`, grouping adjacent income ranges together.

3. **Why only 9 groups in `fct_lump_n()`?**  
   Because `"Other"` is **always included as its own level by default** (`other_level = "Other"`). So `n = 10` means *9 most frequent levels + Other*.


In [None]:
library(tidyverse)
library(forcats)

# 1. Party identification over time (collapsed)
gss_cat |>
  mutate(
    party = fct_collapse(
      partyid,
      rep = c("Strong republican", "Not str republican"),
      dem = c("Strong democrat", "Not str democrat"),
      ind = c("Ind,near rep", "Ind,near dem", "Independent"),
      other = c("No answer", "Don't know", "Other party")
    )
  ) |>
  group_by(year, party) |>
  summarize(n = n(), .groups = "drop") |>
  group_by(year) |>
  mutate(prop = n / sum(n))

# 2. Collapse rincome into broader categories
gss_cat |>
  mutate(
    rincome_simple = fct_collapse(
      rincome,
      low = c("$8000 to 9999", "$10000 - 14999", "$15000 - 19999"),
      middle = c("$20000 - 24999", "$25000 - 29999", "$30000 - 34999"),
      high = c("$35000 - 39999", "$40000 - 49999", "$50000 - 59999", "$60000 - 74999",
               "$75000 or more"),
      other = c("Not applicable", "Refused", "Don't know")
    )
  ) |>
  count(rincome_simple)

# 3. Demonstrating why fct_lump_n(n = 10) yields 9 + Other
gss_cat |>
  mutate(relig = fct_lump_n(relig, n = 10)) |>
  count(relig)


## 16.6 Ordered factors

**Ordered factors** represent categorical variables where the levels have a clear ranking, but the *distance* between levels is unknown. You create them with `ordered()`. When printed, their levels are shown with `<` to emphasize the ranking.

### When ordered factors behave differently
1. **ggplot2 aesthetics**  
   Mapping an ordered factor to `color` or `fill` defaults to a *sequential* palette (`viridis`), which visually implies order.
2. **Modeling**  
   In linear models, ordered factors use **polynomial contrasts** by default. These encode trends across the ordered levels (useful in some fields, but often not interpreted directly).

### Practical guidance
- Use ordered factors when categories are *ranked* (e.g., *low < medium < high*).
- If order is arbitrary, stick with regular factors.
- In many tidyverse workflows, ordered vs. unordered won‚Äôt change much‚Äîbut some domains (especially social sciences) rely on this distinction for correct analysis behavior.


In [None]:
# Create an ordered factor
ord <- ordered(c("low", "medium", "high"),
               levels = c("low", "medium", "high"))
ord

# ggplot2 uses a sequential palette for ordered factors
library(ggplot2)
df <- tibble(level = ord, value = c(1, 2, 3))
ggplot(df, aes(x = value, y = level, color = level)) +
  geom_point(size = 3)

# Ordered factors in a linear model (polynomial contrasts)
lm(value ~ ord, data = df)
